1.796 2.2Q1 

1.78^! 2A79 

1.771 2.160 

1.761 2^145 

1.753 2 ； t3T 

■•■； — 、 


0.258 

0.535 

0.865 

0.257 

0.534 

0.863 

0；257 

0.534 

0.862 

0.257 

0.533 

0.861 

0.257 

0.533 

0.860 


1.071 

1.337 

1.746 

1.069 

1.333 

1.740 

1.067 

1.330 

1.734 

1.066 

1.328 

-1/729 

1.064 

1.325 

1.725 


0.260 

0.540 

0.876 

1:088 

0.259 

0.539 

0.873 

1.083 

0.259 

0.537 

0.870 

1:079 

0.258 

0.537 

0.868 

1.076 

0.258 

€5536 

0.866 

1.074 


0.553 

0.906 

1.134 

0,549 

0.896 

1.119 

0.546 

0.889 

1.108 

0.543 

0.883 

1.100 

0.542 

0.879 

1.093 


1.440 

1.943 

2.447 

1.415 

1.895 

2365 

1.397 

1.860 

2306 

1.383 

1.833 

2,262 

1.372 

1.812 

2.228 


0.325 

0.727 

1.376 

1.963 

3.078 

0:289 

0.617 

1；061 

1.386 

1.886 

0.277 

0.584 

0.978 

1.250 

1638 

0.271 

0.569 

0.941 

1,190 

1.533 

0.267 

0.559 

0.920 

1.156 

1.476 


0.257 

0.532 

0.859 

1.063 

1.323 

1.721 

2X)80 

0.256 

0.532 

0.858 

1.067 

1.321 

1 ： 7l7 

Z074 

0.256 

0.532 

0.858 

1.060 

1.319 

1.714 

3 纖 

0.256 

0.531 

0.857 

1.059 

1.318 

1.7T1 

2.064 

0:256 

0.531 

0.856 

1.058 

1.316 

1.708 

2.060 

0.256 

0.531 

0.856 

1.058 

7.375 

7,706 

2.056 

0.256 

0.531 

0.855 

1.057 

1.314 

1.703 

2.G52 

0.256 

0.530 

0.855 

1.056 

1.313 

1.701 

2.048 

0.256 

0.530 

0.854 

1.055 

1,311 

1.699» 

2 ： Q45 

0.256 

0.530 

0.854 

1.055 

1.310 

1.697 

2,042 

0.255 

0.529 

0.851 

1.050 

1.303 

1.684 

:2 油 

0.254 

0.527 

0.848 

1.045 

1.296 

1.671 

iooo 

0.254 

0.526 

0.845 

1.041 

1.289 

1.658 

1.980 

0.253 

0.524 

0.842 

1.036 

1.282 

1.645 

1.960 


Entry is where P{t(v) < t(A; v)} = A 



A 

.60 .70 .80 .85 .90 .95 .975 

•- j . . : 


TABLE B.2 

Percentiles 
of the t 
Distribution. 


3 6 0 5 1 
6 5 5 4 4 
3 3 3 i I 

_ _ • 


3 3 

•-; • 


0 0 13 6 
2 10 9 8 

.-..V1;H-.Go. 

2 2-2 2 2 


5 3 2 10 
6 6 6 6 6 

2 2 2 2 2 

__••* 

o o o o o 


T. 

7 


6 3 -2 6 
0 0 8 7 

= 7:13 i 7 5, 

、 ••'- t ■-••'•、■'■■• 

2-432 2 

,1 - 


4 - o 3 .2 5 

1,」 2 ;5 . 3 t 

3 9:310 

•■■■••■•■<•• 

6-2 2 " 2 2 




2 3 4 5 6 7 8 9 0 


1213141516r718192021222324-2526272829304Q6020 





^98? 

15：895 

4.849 

3；482 

' "!-•'：'. ~S 

2,999 

2.757 

2.612 
2.5:17 
2.449 
2.398 ‘ 
2 名 9 

2.328 

2.303 

2.282 

2.264 

E249 

2:235 

2.224 

2.214 

2.205 

2vlP7 

2-189 

2:183 

2.177 

2:772 

2.167 

2il62 

2.758 
2.154 
2.150 
2.147 

2.123 

2：099 

2.076 

2,054 


‘985 

21.205 

5.643 

3,896 

3.298 

••v - ^ - 

3’003: 

2:829 

2.715 

2,634 

2.574 

2.527 

2.491 

2.461 

..y ' 

2.436 

2:415 

2397 

2.382 

2368 

2356 

2346 

2.336 

2:328 

2320 

2.313 

2.307 

2,301 

2,296 
2,291 
2.286 
2：282 
2 278 

Z250 
2.223 
2.196 
2 ； 170 


.99 

31:821： 

6.965 

4541 ； 

3.747 

3.365 

3vl43 

2.998 

2.896 

2JB21 

2.764 

2.7T8 

2.681 

2.65b 

2.624 

2.602 

2.583 

2.567 

2^552 

2.539 

2.528 

2:518 

2308 

2.500 

2.492 

2^85 

2.479 

2.473 

2.467 

2:462 

2.457 

2.423 

2:390 

2.358 

2.326 


,9925 

42.434 

8.073 

i047 

4.088 

3.634 

3:372 

3.203 

3:085 

2：998 

2.932 

2:879 

2.836 

2.801 

2.771 

2.746 

2,724 

2706 

2 ： 6B9 

2,674 

2.661 

2.649 

?_ 6 ? 9 

2.629 

2.620 

2,612 

2.605 

2.598 

2.592 

2.586 

2.581 

2,542 

2.504 

2.468 

2.432 


,995 

63.^7 

■••• •- '•彳 h • 

mzs 

5:841 

4,604 

4.032 

3.707 

3.499 

3i355 

3.250 

3.169 

3,106 

3；055 

3.012 

2.977 

2,947 

2.921 

2.898 

2.878 

2.861 

2.845 

2.^31 
Z819 
2,807 
2,797 - 
ZJB7 

2.779 
士 771 
2.763 
2.756 
2.750 ； 

2:704 

2.66G 

2.617 

2^76 


.9975 

127.322 

14.089 

7:453 

5.598 

4.773 

4.317 

4.029 

3.833 

3.690 

3.581 

3.497 

3.428 

3.372 

3.326 

3 ： 286^ 

3.2S2 

3.222 

3.197 

3.174 

3:153 

3.U5 

3.119 

3,104 

3.091 

3.078 

3.067 

3.057 

3.047 

3.038 

3：030 

2:971 

2.915 

2.860 

2.807 


,9995 

636.590 

31.598 

12.924 

8.610 

6.869 

5.959 

5.408 

5.Q41 

4,781 

4.587 

4.437 

4.318 

4221 

:4.14b 

4.073 

4.015 

3.965 

3.922 

3.883 

3,849 

3.819 

3.792 

3.768 

3.745. 

3.725 

3.707 

3.690 

3.674 

3.659 

3.646 

3.551 

3.460 

3.373 

3.291 


TABLE B.2 

(concluded) 
Percentiles 
of the t 
Distribution. 


■' ------ 

:' 2 3 


6(7 8 9 0 


ax 3 4 6 

1 .-T' V 


6 - 7 8 9 

1- T 


20 212223.24,:25:26.2728^304060.''2006' 







Applied Linear 
Statistical Models 



The McGraw-Hill/Irwin Series: Operations and Decision Sciences 


BUSINESS STATISTICS 
Alwan 

Statistical Process Analysis 
First Edition 

Aczel and Sounderpandian 
Complete Business Statistics 
Fifth Edition 

Bowerman and O’Connell 
Business Statistics in Practice 
Third Edition 

Bryant and Smith 
Practical Data Analysis: 

Case Studies in Bn^ness Statistics, 
Volnmes I, II, III 

Second Edition 

Cooper and Schindler 
Business Research Methods 

Eighth Edition 

Delurgio 

Forecasting Princi[des 
and Api^ications 

First Edition 

Dome 

Leamin^itats 
First Edition 


Doane ， Mathieson, and Tracy 
Visual Statistics 
Second Edition, 2.0 

GUlow ， Oppenheim, and Oppenheim 
Quality Management; Tools and 
Methods for Improvement 

Third Edition 

Lind ， Marchal, and Wathen 
Basic Statistics for Bnsiness 
and Economics 
Fourth Edition 

Lind ， Marchal, and Mason 
Statistical Techniques in Bnsiness 
and Economics 

Eleventh Edition 

Merchant ， Goffinet, and Koehler 
Ba^c Statistics Using Excel 
for Office 2000 
Third Edition 

Sahai and Khurshid 
Pocket Dictionary of Statistics 

First Edition 

Siegel 

Practical Bn 珀 ness Statistics 
Fifth Edition 


Wilson, Keating, and John Galt 
Solutions, Inc. 

Bnsiness Forecasting 
Iburth Edition 

Zagorslcy 

Bnsiness Information 

First Edition 


QUANTITATIVE METHODS 
AND MANAGEMENT 
SCIENCE 

Bodily, Carraway, Frey, Pfeifer 
Qnandtadye Business Analysis: 

Text and Cases 
First Edition 

Bonini, Hausman, and Bier man 
Quantitative Analysis for Business 
Decisions 
Ninth Edition 

Hillier and HUlier 

Introduction to Management Science: 

A Modeling and Case Stndies Approach 
with Spreadsheets 

Second Edition 


3 



Applied Linear 
Statistical Models 


Fifth Edition 


Michael H. Kutner 

Emory University 

Christopher J. Nachtsheim 

University of Minnesota 

John Neter 

University of Georgia 

William Li 

University of Minnesota 


H McGraw-Hill 
lh Irwin 

Boston Burr Ridge, IL Dubuque, IA Madison, Wl New York San Francisco St Louis 
Bangkok Bogota Caracas Kuala Lumpur Lisbon London Madrid Mexico City 
Milan Montreal New Delhi Santiago Seoul Singapore Sydney Taipei Toronto 



The McGfOW'Hill Companies 


a McGraw-Hill 
Irwin 

APPLIED UNEAR STATISTICAL MODELS 

Published by McGraw-Hill/Irwin s a business unit of The McGraw-Hill Companies, Inc.，1221 Avenue of the 
Americas, New York ， NY ， 10020. Copyri^it © 2005, 1996, 1990,1983, 1974 by The McGraw-Hill Compan 
Inc. All rights reserved No part of this publication maybe reproduced or distributed in any form or by any 
means, or stored in a database or retrieval system，without the prior written consent of The McGraw-Hill 
Companies, Inc. ， including, but not limited to, in any network or other electronic storage or transmission, or 
broadcast for distance learning- 

Some ancillaries, including electronic and print components, may not be available to customers outside the 
United States. 

This book is printed on acid™free paper. 

1234567890 DOC/DOC 0987654 

ISBN 0-07 238688 6 

Editorial director: Brent Gordon 

Executive editor: Richard T Hercher, Jk 

Editorial assistant: Lee Stone 

Senior marketing manager: Douglas Reiner 

Media producer: Elizabeth Mavetz 

Project manager: Jim Labeots 

Production supervisor: Gina Hangos 

Lead designer: Pam Verrvs 

Supplement producer: Matthew Perry 

Senior digital content specialist: Brian Nacik 

Cover design: Kiem Pohl 

Typeface: 10/12 Times Roman 

Compositor: Interactive Composition Corporation 

Printer: K R Donnelley 


Library of Congress Cataloging-in-Publication Data 


Kutner, Michael H. 

Applied linear statistical models. — 5th ed. / Michael R Kutner.. ■ [et aL]. 

p. cm. — (McGraw-Hill/Irwin series Operations and decision sciences) 

Rev. ed. of: Applied linear regression models. 4th ed. c2004. 

Includes bibliographical references and index. 

ISBN 0 07 238688-6 (acid free paper) 

1. Regression analysis. 2. Mathematical statistics. L Kutner, Michael H. Applied linear 
regression models. II. Title. III. Series. 

QA278.2.K87 2005 

519.536 — dc22 2004052447 

www-mhhe.com 



To 

Nancy, Michelle, Allison, 

,Maureen, Abigael, Andrew, Henry G” 
Dorothy, Ron, David, 

Dezhong, Chenghua, Xu 


(^coofl hi 




Preface 


Linear statistical models for regression，analysis of variance, and experimental design are 
widely used today in business administration, economics, engineering, and the social, health, 
and biological sciences. Successful applications of these models require a sound understand¬ 
ing of both the underlying theory and the practical problems that are encountered in using 
the models in real-life situations. While Applied Linear Statistical Models, Fifth Edition, is 
basically an applied book, it seeks to blend theory and applications effectively, avoiding the 
extremes of presenting theory in isolation and of giving elements of applications without 
the needed understanding of the theoretical foundations. 

The fifth edition differs from the fourth in a number of important respects. 


In the area of regression analysis (Parts I-III): 

1. We have reorganized the chapters for better clarity and flow of topics. Material from 
the old Chapter 15 on normal correlation models has been integrated throughout the 
text where appropriate. Much of the material is now found in an expanded Chapter 
2, which focuses on inference in regression analysis. Material from the old Chapter 7 
pertaining to polynomial and interaction regression models and from old Chapter 11 
on quantitative predictors has been integrated into a new Chapter 8 called, “Models 
for Quantitative and Qualitative Predictors.” Material on model validation from old 
Chapter 10 is now fully integrated with updated material on model selection in a new 
Chapter 9 entitled, “Building the Regression Model I: Model Selection and Validation.” 

2. We have added material on important techniques for data mining, including regression 
trees and neural network models in Chapters 11 and 13, respectively. 

3. The chapter on logistic regression (Chapter 14) has been extensively revised and 
expanded to include a more thorough treatment of logistic, probit, and complemen¬ 
tary log-log models, logistic regression residuals, model selection，model assessment, 
logistic regression diagnostics, and goodness of fit tests. We have also developed new 
material on polytomous (multicategory) nominal logistic regression models and poly- 
tomous ordinal logistic regression models. 

4. We have expanded the discussion of model selection methods and criteria. The Akaike 
information criterion and Schwarz Bayesian criterion have been added, and a greater 
emphasis is placed on the use of cross-validation for model selection and validation. 


In the areas pertaining to the design and analysis of experimental and observational studies 
(Parts IV-VI): 

5. In the previous edition. Chapters 16 through 25 emphasized the analysis of variance, 
and the design of experiments was not encountered formally until Chapter 26. We 
have completely reorganized Parts IV- VI， emphasizing the design of experimental and 
observational studies from the start. In a new Chapter 15, we provide an overview of 
the basic concepts and planning approaches used in the design of experimental and 
observational studies, drawing in part from material from old Chapters 16, 26, and 
27. Fundamental concepts of experimental design, including the basic types of factors, 




Preface vii 


treatments, experimental units, randomization, and blocking are described in detail. 
This is followed by an overview of standard experimental designs, as well as the basic 
types of observational studies, including cross-sectional, retrospective, and prospective 
studies. Each of the design topics introduced in Chapter 15 is then covered in greater 
detail in the chapters that follow. We emphasize the importance of good statistical 
design of scientific studies, and make the point that proper design often leads to a 
simple analysis. We note that the statistical analysis techniques used for observational 
and experimental studies are often the same, but the ability to “prove” cause-and-effect 
requires a carefully designed experimental study. 

6. Previously, the planning of sample sizes was covered in Chapter 26. We now present 
material on planning of sample sizes in the relevant chapter, rather than devoting a 
single, general discussion to this issue. 

7. We have expanded and updated our coverage (Section 24-2) on the interpretation of 
interaction plots for multi-factor studies. 

8. We have reorganized and expanded the material on repeated measures designs in Chap¬ 
ter 27. In particular, we introduce methods for handling the analysis of factor effects 
when interactions between subjects and treatments are important, and when interactions 
between factors are important. 

9. We have added material on the design and analysis of balanced incomplete block 
experiments in Section 28.1, including the planning of sample sizes. A new appendix 
(B.15) has been added that provides standard balanced incomplete block designs. 

10. We have added new material on robust product and process design experiments in 
Chapter 29, and illustrate its use with a case study from the automotive industry. These 
experiments are frequently used in industrial studies to identify product or process 
designs that exhibit low levels of variation. 

The remaining changes pertain to both regression analysis (Parts I-III) and the design and 
analysis of experimental and observational studies (Parts IV-VI): 

11. We have made extensive revisions to the problem material. Problem data sets are 
generally larger and more challenging, and we have included a large number of new 
case data sets in Appendix C. In addition, we have added a new category of chapter 
exercises, called Case Studies. These are open-ended problems that require students ， 
given an overall objective, to carry out complete analyses of the various case data sets in 
Appendix C. They are distinct from the material in the Problems and Projects sections, 
which frequently ask students to simply carry out specific analytical procedures. 

12. We have substantially expanded the amount of graphic presentation, including much 
greater use of scatter plot matrices, three-dimensional rotating plots, three-dimensional 
response surface and contour plots, conditional effects plots, and main effects and 
interaction plots. 

13. Throughout the text, we have made extensive revisions in the exposition on the basis 
of classroom experience to improve the clarity of the presentation. 

We have included in this book not only the more conventional topics in regression and 
design, but also topics that are frequently slighted, though important in practice. We devote 
three chapters (Chapters 9-11) to the model-building process for regression, including 
computer-assisted selection procedures for identifying good subsets of predictor variables 



The Student Solutions Manual and all of the data files on the compact disk can also be 
downloaded from the book’s website at: www.mhhe.com/kutnerALSM5e. A list of errata 
for the book as well as some useful, related links will also be maintained at this address. 

A book such as this cannot be written without substantial assistance from numerous 
persons. We are indebted to the many contributors who have developed the theory and 
practice discussed in this book. We also would like to acknowledge appreciation to our stu¬ 
dents, who helped us in a variety of ways to fashion the method of presentation contained 
herein. We are grateful to the many users of Applied Linear Statistical Models and Applied 
Linear Regression Models, who have provided us with comments and suggestions based 
on their teaching with these texts. We are also indebted to Professors James E. Holstein, 
University of Missouri, and David L. Sherry, University of West Florida, for their review of 
Applied Linear Statistical Models, First Edition; to Professors Samuel Kotz, University of 
Maryland at College Park, Ralph P. Russo, Universityof Iowa, and Peter F. Thall, The George 
Washington University, for their review of Applied Linear Regression Models, First Edition; 
to Professors John S. Y Chiu, University of Washington, James A. Calvin, University of 
Iowa, and Michael F. Driscoll, Arizona State University, for their review of Applied Linear 
Statistical Models, Second Edition; to Professor Richard Anderson-Sprecher, University 
of Wyoming, for his review of Applied Linear Regression Models, Second Edition; and to 
Professors Alexander von Eye, The Pennsylvania State University, Samuel Kotz, University 
of Maryland at College Park, and John B. Willett, Harvard University, for their review of 
Applied Linear Statistical Models, Third Edition; to Professors Jason Abrevaya, Univer¬ 
sity of Chicago, Frank Alt, University of Maryland, Mtoria Chen, Georgia Tech, Rebecca 
Doerge, Purdue University, Mark Henry, Clemson University, Jim Hobert, University of 
Florida, Ken Koehler, Iowa State University, Chii-Dean Lin, University of Massachusscts 
Amherst, Mark Rdscr, Arizona State University, Lawrence Ries, University of Missouri 
Columbia, and Ehsan Soofi, University of Wisconsin Milwaukee, for their reviews of 
Applied Linear Regression Models, Third Edition, or Applied Linear Statistical Models, 
Fourth Edition. These reviews provided many important suggestions, for which we are 
most grateful. 

In addition, valuable assistance was provided by Professors Richard K. Burdick, 
Arizona State University, R. Dennis Cook, University of Minnesota, W. J. Conover, Texas 
Tech University, Mark E. Johnson, University of Central Florida, Dick DeVeaux, Williams 
College, and by Drs. Richard I. Beckman, Los Alamos National Laboratory, Ronald L. 
Iman, Sandia National Laboratories, Lexin Li, University of California Davis, and Brad 
Jones, SAS Institute. We are most appreciative of their willing help. We are also indebted 
to the 88 participants in a survey concerning Applied Linear Regression Models, Second 
Edition, the 76 participants in a survey concerning Applied Linear Statistical Models, Third 
Edition, and the 73 participants in a survey concerning Applied Linear Regression Models, 
Third Edition, or Applied Linear Statistical Models, Fourth Edition. Helpful suggestions 
were received in these surveys, for which we are thankful. 

Weiyong Zhang and Mncent Agboto assisted us diligently in the development of new 
problem material, and Lexin Li and Yingwen Dong helped prepare the revised Instructor 
Solutions Manual and Student Solutions Manual under considerable time pressure. Amy 
Hendrickson provided much-needed LaTeX expertise. George Cotsonis assisted us dili¬ 
gently in preparing computer-generated plots and in checking analysis results. We are most 



Preface xi 


grateful to these persons for their invaluable help and assistance. We also wish to thank 
the various members of the Carlson Executive MBA Program classes of 2003 and 2004; 
notably Mike Ohmes, Trevor Bynum, Baxter Stephenson, Zakir Salyani, Sanders Marvin, 
Trent Spurgeon, Nate Ogzawalla, David Mott, Preston McKenzie, Bruce DeJong, and Tim 
Kensok, for their contributions of interesting and relevant case study data and materials. 

Finally, our families bore patiently the pressures caused by our commitment to complete 
this revision. We are appreciative of their understanding. 

Michael H. Kutner 
Christopher J. Nachtsheim 
John Neter 
William U 


\ 



Contents 


PART ONE 

SIMPLE LINEAR REGRESSION 1 

Chapter 1 

Linear Regression with One Predictor 
Variable 2 

1.1 Relations between Variables 2 

Functional Relation between Two 
Variables 2 

Statistical Relation between Two Variables 3 

1.2 Regression Models and Their Uses 5 

Historical Origins 5 
Basic Concepts 5 

Construction of Regression Models 7 
Uses of Regression Analysis 8 
Regression and Causality 8 
Use of Computers 9 

1.3 Simple Linear Regression Model 
with Distribution of Error Terms 
Unspecified 9 

Formal Statement of Model 9 
Important Features of Model 9 
Meaning of Regression Parameters 11 
Alternative Versions of Regression Model 12 

1.4 Data for Regression Analysis 12 

Observational Data 12 
Experimental Data 13 
Completely Randomized Design 13 

1.5 Overview of Steps in Regression 
Analysis 13 

1.6 Estimation of Regression Function 15 

Method of Least Squares 15 

Point Estimation of Mean Response 21 

Residuals 22 

Properties of Fitted Regression Line 23 

1.7 Estimation of Error Terms Variance a 2 24 

Point Estimator of a 2 24 

1.8 Normal Error Regression Model 26 

Model 26 

Estimation of Parameters by Method 
of Maximum Likelihood 27 


Cited References 33 
Problems 33 
Exercises 37 
Projects 38 

Chapter 2 

Inferences in Regression and Correlation 
Analysis 40 

2.1 Inferences Concerning j8i 40 

Sampling Distribution ofb' 41 
Sampling Distribution of{b x — ^\)/s{b\} 44 
Confidence Interval for 45 

Tests Concerning 47 

2.2 Inferences Concerning j6 0 48 

Sampling Distribution of bo 48 
Sampling Distribution of(bQ—p Q )/s{bo) 49 
Confidence Interval for 49 

2.3 Some Considerations on Making Inferences 
Concerning j3o and 仇 50 

Effects of Departures from Normality 50 
Interpretation of Confidence Coefficient 
and Risks of Errors 50 
Spacing of the X Levels 50 
Power of Tests 50 

2.4 Interval Estimation of 別 !^} 52 

_ 八 

Sampling Distribution ofY^ 52 
Sampling Distribution of 
(7, - 54 

Confidence Interval for E [Yh } 54 

2.5 Prediction of New Observation 55 

Prediction Interval for Y h(xiew ^ when 

Parameters Known 56 

Prediction Interval for i^new) when 

Parameters Unknown 57 

Prediction of Mean ofm New Observations 

for Given Xh 60 

2.6 Confidence Band for Regression Line 61 

2.7 Analysis of Variance Approach 
to Regression Analysis 63 

Partitioning of Total Sum of Squares 63 
Breakdown of Degrees of Freedom 66 




Contents xiii 


Mean Squares 66 
Analysis of Variance Table 67 
Expected Mean Squares 68 
F Test of Pi = 0 versus ^ 0 69 

2.8 General Linear Test Approach 72 

Full Model 72 
Reduced Model 72 
Test Statistic 73 
Summary 73 

2.9 Descriptive Measures of Linear Association 
between X and Y 74 

Coefficient of Determination 74 
Limitations of R 2 75 
Coefficient of Correlation 76 

2.10 Considerations in Applying Regression 
Analysis 77 

Z11 Normal Correlation Models 78 

Distinction between Regression and 
Correlation Model 78 
Bivariate Normal Distribution 78 
Conditional Inferences 80 
Inferences on Correlation Coefficients 83 
Spearman Rank Correlation Coefficient 87 
Cited References 89 
Problems 89 
Exercises 97 
Projects 98 

Chapter 3 

Diagnostics and Remedial Measures 100 

3.1 Diagnostics for Predictor Variable 100 

3.2 Residuals 102 

Properties of Residuals 102 
Semistudentized Residuals 103 
Departures from Model to Be Studied by 
Residuals 103 

3.3 Diagnostics for Residuals 103 

Nonlinearity of Regression Function 104 
Nonconstancy of Error Variance 107 ^ 
Presence of Outliers 108 
Nonindependence of Error Terms 108 
Nonnormality of Error Terms 110 
Omission of Important Predictor 
Variables 112 
Some Final Comments 114 


3.4 Overview of Tests Involving 
Residuals 114 

Tests for Randomness 114 
Tests for Constancy of Variance 115 
Tests for Outliers 115 
Tests for Normality 115 

3.5 Correlation Test for Normality 115 

3.6 Tests for Constancy of Error 
Variance 116 

Brown-Forsythe Test 116 
Breusch-Pagan Test 118 

3.7 F Test for Lack of Fit 119 

Assumptions 119 
Notation 121 
Full Model 121 
Reduced Model 123 
Test Statistic 123 
ANOVA Table 124 

3.8 Overview of Remedial Measures 127 

Nonlinearity of Regression 
Function 128 

Nonconstancy of Error Variance 128 
Nonindependence of Error Terms 128 
Normormality of Error Terms 128 
Omission of Important Predictor 
Variables 129 
Outlying Observations 129 

3.9 Transformations 129 

Transformations for Nonlinear 
Relation Only 129 
Transformations for Nonnormality 
and Unequal Error Variances 132 
Box-Cox Transformations 134 

3.10 Exploration of Shape of Regression 
Function 137 

Lowess Method 138 

Use of Smoothed Curves to Confirm Fitted 

Regression Function 139 

3.11 Case Example ― Plutonium 
Measurement 141 
Cited References 146 
Problems 1.46 
Exercises 151 
Projects 152 

Case Studies 153 、、 



xiv Contents 


Chapter 4 

Simultaneous Inferences and Other 
Topics in Regression Analysis 154 

4.1 Joint Estimation of ^ and ^ 154 

Need for Joint Estimation 154 
Bonferroni Joint Confidence Intervals 155 

4.2 Simultaneous Estimation of Mean 
Responses 157 

Working-Hotelling Procedure 158 
Bonferroni Procedure 159 

4.3 Simultaneous Prediction Intervals 
for New Observations 160 

4.4 Regression through Origin 161 

Model 161 
Inferences 161 

Important Cautions for Using Regression 
through Origin 164 

4.5 Effects of Measurement Errors 165 

Measurement Errors in Y 165 
Measurement Errors in X 165 
Berkson Model 167 

4.6 Inverse Predictions 168 

4.7 Choice of X Levels 170 
Cited References 172 
Problems 172 
Exercises 175 
Projects 175 

Chapter 5 

Matrix Approach to Simple 
Linear Regression Analysis 176 

5.1 Matrices 176 

Definition of Matrix 176 
Square Matrix 178 
Vector 178 
Transpose 178 
Equality of Matrices 179 

5.2 Matrix Addition and Subtraction 180 

5.3 Matrix Multiplication 182 

Multiplication of a Matrix by a Scalar 182 
Multiplication of a Matrix by a Matrix 182 

5.4 Special Types of Matrices 185 

Symmetric Matrix 185 
Diagonal Matrix 185 


Vector and Matrix with All Elements 
Unity 187 
Zero Vector 187 

5.5 Linear Dependence and Rank 
of Matrix 188 

Linear Dependence 188 
Rank of Matrix 188 

5.6 Inverse of a Matrix 189 

Finding the Inverse 190 
Uses of Inverse Matrix 192 

5.7 Some Basic Results for Matrices 193 

5.8 Random Vectors and Matrices 193 

Expectation of Random Vector or Matrix 
Variance- Covariance Matrix 
of Random Vector 194 
Some Basic Results 196 
Multivariate Normal Distribution 196 

5.9 Simple Linear Regression Model 
in Matrix Terms 197 

5.10 Least Squares Estimation 

of Regression Parameters 199 
Normal Equations 199 
Estimated Regression Coefficients 200 

5.11 Fitted Values and Residuals 202 

Fitted Values 202 
Residuals 203 

5.12 Analysis of Variance Results 204 

Sums of Squares 204 
Sums of Squares as Quadratic 
Forms 205 

5.13 Inferences in Regression Analysis 206 

Regression Coefficients 207 
Mean Response 208 
Prediction of New Observation 209 
Cited Reference 209 
Problems 209 
Exercises 212 

PART TWO 
MULTIPLE LINEAR 
REGRESSION 213 

Chapter 6 

Multiple Regression I 214 

6.1 Multiple Regression Models 214 



Contents xv 


Need for Several Predictor Variables 214 
First-Order Model with Two Predictor 
Variables 215 

First-Order Model with More than Two 

Predictor Variables 217 

General Linear Regression Model 217 

6.2 General Linear Regression Model in Matrix 
Terms 222 

6.3 Estimation of Regression Coefficients 223 

6.4 Fitted Values and Residuals 224 

6.5 Analysis of Variance Results 225 

Sums of Squares and Mean Squares 225 
F Test for Regression Relation 226 
Coefficient of Multiple Determination 226 
Coefficient of Multiple Correlation 227 

6.6 Inferences about Regression 
Parameters 227 

Interval Estimation of p k 228 
Tests for 私 228 
Joint Inferences 228 

6.7 Estimation of Mean Response and 
Prediction of New Observation 229 

Interval Estimation ofE{Yf } ) 229 
Confidence Region for Regression 
Surface 229 

Simultaneous Confidence Intervals for Several 
Mean Responses 230 
Prediction of New Observation 〜聊） 230 

Prediction of Mean ofm New Observations 
at Xf t 230 

Predictions of g New Observations 231 
Caution about Hidden Extrapolations 231 

6.8 Diagnostics and Remedial Measures 232 

Scatter Plot Matrix 232 
Three-Dimensional Scatter Plots 233 
Residual Plots 233 
Correlation Test for Normality 234 
Brown-Forsythe Test for Constancy of Error 
Variance 234 — 

Breusch-Pagan Test for Constancy of Error 
Variance 234 
F Test for Lack of Fit 235 
Remedial Measures 236 

6.9 An Example — Multiple Regression with 
Two Predictor Variables 236 

Setting 236 


Basic Calculations 237 
Estimated Regression Function 240 
Fitted Values and Residuals 241 
Analysis of Appropriateness of Model 241 
Analysis of Variance 243 
Estimation of Regression Parameters 245 
Estimation of Mean Response 245 
Prediction Limits for New Observations 247 
Cited Reference 248 
Problems 248 
Exercises 253 
Projects 254 

Chapter 7 

Multiple Regression II 256 

7.1 Extra Sums of Squares 256 

Basic Ideas 256 
Definitions 259 

Decomposition ofSSR into Extra Sums 
of Squares 260 

ANOVA Table Containing Decomposition 
ofSSR 261 

7.2 Uses of Extra Sums of Squares in Tests for 
Regression Coefficients 263 

Test whether a Single Pk — 0 263 
Test whether Several Pk =0 264 
73 Summary of Tests Concerning Regression 
Coefficients 266 

Test whether All p k = 0 266 
Test whether a Single — 0 267 

Test whether Some fik—O 267 
Other Tests 268 

7.4 Coefficients of Partial Determination 268 

Two Predictor Variables 269 
General Case 269 

Coefficients of Partial Correlation 270 

7.5 Standardized Multiple Regression 
Model 271 

Roundoff Errors in Normal Equations 

Calculations 271 

Lack of Comparability in Regression 

Coefficients 272 \ 

Correlation Transformation 272 

Standardized Regression Model 273 

X'X Matrix for Transformed Variables 274 



xvi Contents 


Estimated Standardized Regression 
Coefficients 275 

7.6 Multicollinearity and Its Effects 278 
Uncorrelated Predictor Variables 279 
Nature of Problem when Predictor Variables 
Are Perfectly Correlated 281 
Effects of Multicollinearity 283 
Need for More Poweffid Diagnostics for 
Multicollinearity 289 
Cited Reference 289 
Problems 289 
Exercise 292 
Projects 293 

Chapter 8 

Regression Models for Quantitative 
and Qualitative Predictors 294 

8.1 Polynomial Regression Models 294 

Uses of Polynomial Models 294 
One Predictor Variable~Second Order 295 
One Predictor Variable — Third Order 296 
One Predictor Variable—Higher Orders 296 
Two Predictor Variables ― Second Order 297 
Three Predictor Variables~Second 
Order 298 

Implementation of Polynomial Regression 

Models 298 

Case Example 300 

Some Further Comments on Polynomial 

Regression 305 

8.2 Interaction Regression Models 306 

Interaction Effects 306 
Interpretation of Interaction Regression 
Models with Linear Effects 306 
Interpretation of Interaction Regression 
Models with Curvilinear Effects 309 
Implementation of Interaction Regression 
Models 311 

8.3 Qualitative Predictors 313 

Qualitative Predictor with Two 
Classes 314 

Interpretation of Regression Coefficients 315 
Qualitative Predictor with More than Two 
Classes 318 

Time Series Applications 319 


8.4 Some Considerations in Using Indicator 
Variables 321 

Indicator Variables versus Allocated 
Codes 321 

Indicator Variables versus Quantitative 
Variables 322 

Other Codings for Indicator Variables 323 

8.5 Modeling Interactions between Quantitative 
and Qualitative Predictors 324 

Meaning of Regression Coefficients 324 

8.6 More Complex Models 327 

More than One Qualitative Predictor 
Variable 328 

Qualitative Predictor Variables Only 329 

8.7 Comparison of Two or More Regression 
Functions 329 

Soap Production Lines Example 330 
Instrument Calibration Study Example 334 
Cited Reference 335 
Problems 335 
Exercises 340 
Projects 341 
Case Study 342 

Chapter 9 

Building the Regression Model I: 

Model Selection and Validation 343 

9.1 Overview of Model-Building Process 343 

Data Collection 343 
Data Preparation 346 
Preliminary Model Investigation 346 
Reduction of Explanatory Variables 347 
Model Refinement and Selection 349 
Model Validation 350 

9.2 Surgical Unit Example 350 

93 Criteria for Model Selection 353 
R 2 p or SSEp Criterion 354 
R^ p or MSEp Criterion 355 
Mallows 7 C p Criterion 357 
AlCp and SBC P Criteria 359 
PRESS p Criterion 360 

9.4 Automatic Search Procedures for Model 
Selection 361 

“Best” Subsets Algorithm 361 
Stepwise Regression Methods 364 



Contents xvii 


Forward Stepwise Regression 364 
Other Stepwise Procedures 367 

9.5 Some Final Comments on Automatic 
Model Selection Procedures 368 

9.6 Model Validation 369 

Collection of New Data to Check 
Model 370 

Comparison with Theory, Empirical 
Evidence, or Simulation Results 371 
Data Splitting 372 
Cited References 375 
Problems 376 
Exercise 380 
Projects 381 
Case Studies 382 

Chapter 10 

Building the Regression Model II: 
Diagnostics 384 

10.1 Model Adequacy for a Predictor 
Variable — Added-Variable Plots 384 

10.2 Identifying Outlying Y Observations — 
Studentized Deleted Residuals 390 

Outlying Cases 390 
Residuals and Semistudentized 
Residuals 392 
Hat Matrix 392 
Studentized Residuals 394 
Deleted Residuals 395 
Studentized Deleted Residuals 396 

10.3 Identifying Outlying X Observations ― Hat 
Matrix Leverage Values 398 

Use of Hat Matrix for Identifying Outlying 
X Observations 398 
Use of Hat Matrix to Identify Hidden 
Extrapolation 400 

10.4 Identifying Influential Cases ― DFFITS, 

Cook’s Distance, and DFBETAS 
Measures 400 — 

Influence on Single Fitted 

Value—DFFITS 401 

Influence on All Fitted Values — Cook's 

Distance 402 

Influence on the Regression 

Coefficients—DFBETAS 404 


Influence on Inferences 405 
Some Final Comments 406 

10.5 Multicollinearity Diagnostics — Variance 
Inflation Factor 406 

Informal Diagnostics 407 
Variance Inflation Factor 408 

10.6 Surgical Unit Example — Continued 410 
Cited References 414 

Problems 414 
Exercises 419 
Projects 419 
Case Studies 420 

Chapter 11 

Building the Regression Model III: 
Remedial Measures 421 

11.1 Unequal Error Variances Remedial 
Measures — Weighted Least Squares 421 

Error Variances Known 422 
Error Variances Known up to 
Proportionality Constant 424 
Error Variances Unknown 424 1 

11.2 Multicollinearity Remedial 
Measures — Ridge Regression 431 

Some Remedial Measures 431 
Ridge Regression 432 

11.3 Remedial Measures for Influential 
Cases — Robust Regression 437 

Robust Regression 438 
IRLS Robust Regression 439 

11.4 Nonparametric Regression: Lowess 
Method and Regression Trees 449 

Lowess Method 449 
Regression Trees 453 

11.5 Remedial Measures for Evaluating 
Precision in Nonstandard 
Situations ― Bootstrapping 458 

General Procedure 459 
Bootstrap Sampling 459 
Bootstrap Confidence Intervals 460 

11.6 Case Example — MNDOT Traffic 
Estimation" 464 

The AADT Database 464 S 

Model Development 465 

Weighted Least Squares Estimation 468 



xviix Contents 


Cited References 471 
Problems 472 
Exercises 476 
Projects 476 
Case Studies 480 

Chapter 12 
Autocorrelation in Time 
Series Data 481 

12.1 Problems of Autocorrelation 481 

12.2 First-Order Autor^ressive Error 
Model 484 

Simple Linear Regression 484 
Multiple Regression 484 
Properties of Error Terms 485 

123 Durbin-Watson Test for 
Autocorrelation 487 

1 2.4 Remedial Measures for 
Autocorrelation 490 

Addition of Predictor Variables 490 
Use of Transformed Variables 490 
Cochrane-Orcutt Procedure 492 
Hildreth-Lu Procedure 495 
First Differences Procedure 496 
Comparison of Three Methods 498 

1 2.5 Forecasting with Autocorrelated Error 
Terms 499 

Cited References 502 
Problems 502 
Exercises 507 
Projects 508 
Case Studies 508 

PART THREE 

NONLINEAR REGRESSION 509 

Chapter 13 

夏 ntroduction to Nonlinear Regression 
and Neural Networks 510 

13.1 Linear and Nonlinear Regression 
Models 510 

Linear Regression Models 510 
Nonlinear Regression Models 511 
Estimation of Regression Parameters 514 


13.2 Least Squares Estimation in Nonlinear 
Regression 515 

Solution of Normal Equations 517 
Direct Numerical Search — Gauss-Newton 
Method 518 

Other Direct Search Procedures 525 

13.3 Model Building and Diagnostics 526 

13.4 Inferences about Nonlinear Regression 

Parameters 527 , 

Estimate of Error Term Variance 527 
Large-Sample Theory 528 1 

When Is Large-Sample Theory 
Applicable? 528 

Interval Estimation of a Single y% 531 
Simultaneous Interval Estimation 

of Several Yk 532 

Test Concerning a Single y k 532 

Test Concerning Several y% 533 

13.5 Learning Curve Example 533 

13.6 Introduction to Neural Network 
Modeling 537 

Neural Network Model 537 
Network Representation 540 
Neural Network as Generalization of Linear 
Regression 541 

Parameter Estimation: Penalized Least 
Squares 542 

Example: Ischemic Heart Disease 543 
Model Interpretation and 
Prediction 546 

Some Final Comments on Neural Network 
Modeling 547 
Cited References 547 
Problems 548 
Exercises 552 
Projects 552 
Case Studies 554 

Chapter 14 

Logistic Regression, Poisson Regression, 
and Generalized Linear Models 555 

14.1 Regression Models with Binary Response 
Variable 555 

Meaning of Response Function when 
Outcome Variable Is Binary 556 



Contents xix 


Special Problems when Response Variable 
Is Binary 557 

14.2 Sigmoidal Response Functions 
for Binary Responses 559 

Probit Mean Response Function 559 
Logistic Mean Response Function 560 
Complementary Log-Log Response 
Function 562 

14.3 Simple Logistic Regression 563 

Simple Logistic Regression Model 563 
Likelihood Function 564 
Maximum Likelihood Estimation 564 
Interpretation ofb\ 567 
Use of Probit and Complementary Log-Log 
Response Functions 568 
Repeat Observations — Binomial 
Outcomes 568 

14.4 Multiple Logistic Regression 570 

Multiple Logistic Regression Model 570 

Fitting of Model 571 

Polynomial Logistic Regression 575 

14.5 Inferences about Regression 
Parameters 577 

Test Concerning a Single p k : Wald 
Test 578 

Interval Estimation of a Single )6* 579 
Test whether Several = 0: Likelihood 
Ratio Test 580 

14.6 Automatic Model Selection 
Methods 582 

Model Selection Criteria 582 
Best Subsets Procedures 583 
Stepwise Model Selection 583 

14.7 Tests for Goodness of Fit 586 

Pearson Chi-Square Goodness 
of Fit Test 586 

Deviance Goodness of Fit Test 588 
Hosmer-Lemeshow Goodness 
of Fit Test 589 

14.8 Logistic Regression Diagnostics 59f 

Logistic Regression Residuals 591 
Diagnostic Residual Plots 594 
Detection of Influential 
Observations 598 

14.9 Inferences about 
Mean Response 602 


Point Estimator 602 
Interval Estimation 602 
Simultaneous Confidence Intervals for 
Several Mean Responses 603 

14.10 Prediction of a New Observation 604 

Choice of Prediction Rule 604 
Validation of Prediction Error Rate 607 

14.11 Polytomous Logistic Regression for 
Nominal Response 608 

Pregnancy Duration Data 
with Polytomous Response 609 
J — 1 Baseline-Category Logits for 
Nominal Response 610 
Maximum Likelihood Estimation 612 

14.12 Polytomous Logistic Regression 
for Ordinal Response 614 

14.13 Poisson Regression 618 

Poisson Distribution 618 
Poisson Regression Model 619 
Maximum Likelihood Estimation 620 
Model Development 620 
Inferences 621 

14.14 Generalized Linear Models 623 
Cited References 624 
Problems 625 

Exercises 634 
Projects 635 
Case Studies 640 

PART FOUR 

DESIGN AND ANALYSIS OF 
SINGLE-FACTOR STUDIES 641 

Chapter 15 

Introduction to the Design of 
Experimental and Observational 
Studies 642 

15.1 Experimental Studies, Observational 
Studies, and Causation 643 

Experimental Studies 643 
Observational Studies 644 
Mixed Experimental and Observational 
Studies 646 

15.2 Experimental Studies: Basic 
Concepts 647 



xx Contents 


Factors 647 

Crossed and Nested Factors 648 
Treatments 649 
Choice of Treatments 649 
Experimental Units 652 
Sample Size and Replication 652 
Randomization 653 
Constrained Randomization: 

Blocking 655 
Measurements 658 

15.3 An Overview of Standard Experimental 
Designs 658 

Completely Randomized Design 659 
Factorial Experiments 660 
Randomized Complete Block 
Designs 661 
Nested Designs 662 
Repeated Measures Designs 663 
Incomplete Block Designs 664 
Two-Level Factorial and Fractional 
Factorial Experiments 665 
Response Surface Experiments 666 

15.4 Design of Observational Studies 666 

Cross-Sectional Studies 666 
Prospective Studies 667 
Retrospective Studies 667 
Matching 668 

15.5 Case Study: Paired-Comparison 
Experiment 669 

15.6 Concluding Remarks 672 
Cited References 672 
Problems 672 
Exercise 676 

Chapter 16 

Single-Factor Studies 677 

16.1 Single-Factor Experimental and 
Observational Studies 677 

16.2 Relation between Regression and 
Analysis of Variance 679 

Illustrations 679 

Choice between Two Types of Models 680 

16.3 Single-Factor ANOVA Model 681 

Basic Ideas 681 


Cell Means Model 681 
Important Features of Model 682 
The ANOVA Model Is a Linear 
Model 683 

Interpretation of Factor Level Means 
Distinction between ANOVA Models 1 
and II 685 

16.4 Fitting of ANOVA Model 685 

Notation 686 

Least Squares and Maximum Likelihood 
Estimators 687 
Residuals 689 

16.5 Analysis of Variance 690 

Partitioning of SSTO 690 
Breakdown of Degrees of Freedom 693 
Mean Squares 693 
Analysis of Variance Table 694 
Expected Mean Squares 694 

16.6 F Test for Equality of Factor Level 
Means 698 

Test Statistic 698 
Distribution of F* 699 
Construction of Decision Rule 699 

16.7 Alternative Formulation of Model 701 

Factor Effects Model 701 
Definition of fi. 702 
Test for Equality of Factor Level 
Means 704 

16.8 Regression Approach to Single-Factor 
Analysis of Variance 704 

Factor Effects Model with Unweighted 
Mean 705 

Factor Effects Model with Weighted 
Mean 709 

Cell Means Model 710 

16.9 Randomization Tests 712 

16.10 Planning of Sample Sizes with Power 
Approach 716 

Power of F Test 716 

Use of Table B. 12 for Single-Factor 

Studies 718 

Some Further Observations on Use 
of Table B. 12 720 

16.11 Planning of Sample Sizes to Find “Best” 
Treatment 721 

Cited Reference 722 



Contents xxi 


Problems 722 
Exercises 730 
Projects 730 
Case Studies 732 

Chapter 17 

Analysis of Factor Level Means 733 

17.1 Introduction 733 

17.2 Plots of Estimated Factor Level 
Means 735 

Line Plot 735 

Bar Graph and Main Effects Plot 736 

17.3 Estimation and Testing of Factor Level 
Means 737 

Inferences for Single Factor Level 
Mean 737 

Inferences for Difference between Two 
Factor Level Means 739 
Inferences for Contrast of Factor Level 
Means 741 

Inferences for Linear Combination of 
Factor Level Means 743 

17.4 Need for Simultaneous Inference 
Procedures 744 

17.5 Tukey Multiple Comparison 
Procedure 746 

Studentized Range Distribution 746 
Simultaneous Estimation 747 
Simultaneous Testing 747 
Example 1—Equal Sample Sizes 748 
Example 2 — Unequal Sample Sizes 750 

17.6 Scheffe Multiple Comparison 
Procedure 753 

Simultaneous Estimation 753 
Simultaneous Testing 754 
Comparison of Scheffe and Tukey 
Procedures 755 

17.7 Bonferroni Multiple Comparison 

Procedure 756 一 

Simultaneous Estimation 756 
Simultaneous Testing 756 
Comparison of Bonferroni Procedure with 
Scheffe and Tukey Procedures 757 
Analysis of Means 758 


17.8 Planning of Sample Sizes with Estimation 
Approach 759 

Example 1—Equal Sample Sizes 759 
Example 2 — Unequal Sample Sizes 761 

17.9 Analysis of Factor Effects when Factor 
Is Quantitative 762 

Cited References 766 
Problems 767 
Exercises 773 
Projects 774 
Case Studies 774 

Chapter 18 

ANOVA Diagnostics and Remedial 
Measures 775 釦 

18.1 Residual Analysis 775 

Residuals 776 

Residual Plots 776 

Diagnosis of Departures from ANOVA 

Model 778 

18.2 Tests for Constancy of Error 
Variance 781 

Hartley Test 782 
Brown-Forsythe Test 784 

18.3 Overview of Remedial Measures 786 

18.4 Weighted Least Squares 786 

18.5 Transformations of Response 
Variable 789 

Simple Guides to Finding a 
Transformation 789 
Box-Cox Procedure 791 

18.6 Effects of Departures from Model 793 

Nonnormality 793 

Unequal Error Variances 794 

Nonindependence of Error Terms 794 

18.7 Nonparametric Rank F Test 795 J 

Test Procedure 795 
Multiple Pairwise Testing 
Procedure 797 

18.8 Case Example — Heart Transplant 798 
Cited References 801 

Problems *801 
Exercises 807 
Projects 807 
Case Studies 809 



xxii Contents 


PART FIVE 

MULTI-FACTOR STUDIES 811 

Chapter 19 

Two-Factor Studies with Equal 
Sample Sizes 812 

19.1 Two-Factor Observational and 
Experimental Studies 812 

Examples of Two-Factor Experiments and 
Observational Studies 812 
The One-Factor-at-a-Time (OFAAT) 
Approach to Experimentation 815 
Advantages of Crossed, Multi-Factor 
Designs 816 

19.2 Meaning of ANOVA Model 
Elements 817 

Illustration 817 
Treatment Means 817 
Factor Level Means 818 
Main Effects 818 
Additive Factor Effects 819 
Interacting Factor Effects 822 
Important and Unimportant 
Interactions 824 

Transformable and Nontransformable 

Interactions 826 

Interpretation of Interactions 827 

19.3 Model I (Fixed Factor Levels) for 
Two-Factor Studies 829 

Cell Means Model 830 
Factor Effects Model 831 
19 A Analysis of Variance 833 

Illustration 833 
Notation 834 

Fitting of ANOVA Model 834 
Partitioning of Total Sum 
of Squares 836 

Partitioning of Degrees of Freedom 839 
Mean Squares 839 
Expected Mean Squares 840 
Analysis of Variance Table 840 
19.5 Evaluation of Appropriateness of 
ANOVA Model 842 


19.6 F Tests 843 

Test for Interactions 844 
Test for Factor A Main Effects 844 
Test for Factor B Main Effects 845 
Kimball Inequality 846 

19.7 Strategy for Analysis 847 

19.8 Analysis of Factor Effects when Factors 
Do Not Interact 848 

Estimation of Factor Level Mean 848 
Estimation of Contrast of Factor Level 
Means 849 

Estimation of Linear Combination of 

Factor Level Means 850 

Multiple Pairwise Comparisons of Factor 

Level Means 850 

Multiple Contrasts of Factor Level 

Means 852 

Estimates Based on Treatment 
Means 853 

Example I — Pairwise Comparisons 
of Factor Level Means 853 
Example 2~Estimation of Treatment 
Means 855 

19.9 Analysis of Factor Effects when 
Interactions Are Important 856 

Multiple Pairwise Comparisons 
of Treatment Means 856 
Multiple Contrasts of Treatment 
Means 857 

Example 1 — Pairwise Comparisons 
of Treatment Means 857 
Example 2 — Contrasts of Treatment 
Means 860 

19.10 Pooling Sums of Squares in Two-Factor 
Analysis of Variance 861 

19.11 Planning of Sample Sizes for Two-Factor 
Studies 862 

Power Approach 862 
Estimation Approach 863 
Finding the "Best" Treatment 864 
Problems 864 
Exercises 876 
Projects 876 
Case Studies 879 




Contents xxixi 


Chapter 20 

Two-Factor Studies — One Case 
per Treatment 880 

20.1 No-Interaction Model 880 

Model 881 

Analysis of Variance 881 
Inference Procedures 881 
Estimation of Treatment Mean 884 

20.2 Tukey Test for Additivity 886 

Development of Test Statistic 886 
Remedial Actions if Interaction Effects 
Are Present 888 
Cited Reference 889 
Problems 889 
Exercises 891 
Case Study 891 

Chapter 21 

Randomized Complete Block 
Designs 892 

21.1 Elements of Randomized Complete Block 
Designs 892 

Description of Designs 892 
Criteria for Blocking 893 
Advantages and Disadvantages 894 
How to Randomize 895 
Illustration 895 

21.2 Model for Randomized Complete Block 
Designs 897 

21.3 Analysis of Variance and Tests 898 

Fitting of Randomized Complete 
Block Model 898 
Analysis of Variance 898 

21.4 Evaluation of Appropriateness 
of Randomized Complete Block 
Model 901 

Diagnostic Plots 901 

Tukey Test for Additivity 903 w 

21 .5 Analysis of Treatment Effects 904 

21.6 Use of More than One Blocking 
Variable 905 

21.7 Use of More than One Replicate in Each 
Block 906 


21.8 Factorial Treatments 908 

21.9 Planning Randomized Complete Block 
Experiments 909 

Power Approach 909 
Estimation Approach 910 
Efficiency of Blocking Variable 911 
Problems 912 
Exercises 916 

Chapter 22 

Analysis of Covariance 917 

22.1 Basic Ideas 917 

How Covariance Analysis Reduces Error 
Variability 917 ^ 

Concomitant Variables 919 

22.2 Single-Factor Covariance Model 920 

Notation 921 

Development of Covariance Model 921 
Properties of Covariance Model 922 
Genemlimtions of Covariance 
Model 923 

Regression Formula of Covariance 
Model 924 

Appropriateness of Covariance 
Model 925 

Inferences of Interest 925 

22.3 Example of Single-Factor Covariance 
Analysis 926 

Development of Model 926 
Test for Treatment Effects 928 
Estimation of Treatment Effects 930 
Test for Parallel Slopes 932 

22.4 Two-Factor Covariance Analysis 933 

Covariance Model for Two-Factor 
Studies 933 

Regression Approach 934 } 

Covariance Analysis for Randomized 
Complete Block Designs 937 

22.5 Additional Considerations for the Use 
of Covariance Analysis 939 

Covariance Analysis as Alternative 
to Blocking 939 
Use of Differences 939 
Correction for Bias 940 



xxiv Contents 


Interest in Nature of Treatment 
Effects 940 
Problems 941 
Exercise 947 
Projects 947 
Case Studies 950 

Chapter 23 

Two-Factor Studies with Unequal 
Sample Sizes 951 

23.1 Unequal Sample Sizes 951 

Notation 952 

23.2 Use of Regression Approach for Testing 
Factor Effects when Sample Sizes Are 
Unequal 953 

Regression Approach to Two-Factor 
Analysis of Variance 953 

23.3 Inferences about Factor Effects when 
Sample Sizes Are Unequal 959 

Example 1 — Pairwise Comparisons 
of Factor Level Means 962 
Example 2 ― Single-Degree-of-Freedom 
Test 964 

23.4 Empty Cells in Two-Factor Studies 964 

Partial Analysis of Factor Effects 965 
Analysis if Model with No Interactions Can 
Be Employed 966 
Missing Observations in Randomized 
Complete Block Designs 967 

23.5 ANOVA Inferences when Treatment 
Means Are of Unequal Importance 970 

Estimation of Treatment Means and Factor 
Effects 971 

Test for Interactions 972 

Tests for Factor Main Effects by Use 

of Equivalent Regression Models 972 

Tests for Factor Main Effects by Use 

of Matrix Formulation 975 

Tests for Factor Effects when Weights Are 

Proportional to Sample Sizes 977 

23.6 Statistical Computing Packages 980 
Problems 981 

Exercises 988 
Projects 988 
Case Studies 990 


Chapter 24 

Multi-Factor Studies 992 

24.1 ANOVA Model for Three-Factor 
Studies 992 

Notation 992 
Illustration 993 
Main Effects 993 
Two-Factor Interactions 995 
Three-Factor Interactions 996 
Cell Means Model 996 
Factor Effects Model 997 

24.2 Interpretation of Interactions 
in Three-Factor Studies 998 

Learning Time Example 1: Interpretation 
of Three-Factor Interactions 998 
Learning Time Example 2: Interpretation 
of Multiple Two-Factor Interactions 999 
Learning Time Example 3: Interpretation 
of a Single Two-Factor Interaction 1000 

24.3 Fitting of ANOVA Model 1003 

Notation 1003 
Fitting of ANOVA Model 1003 
Evaluation of Appropriateness of ANOVA 
Model 1005 

24.4 Analysis of Variance 1008 

Partitioning of Total Sum of Squares 1008 
Degrees of Freedom and Mean 
Squares 1009 
Tests for Factor Effects 1009 

24.5 Analysi s of Factor Effects 1013 

Strategy for Analysis 1013 

Analysis of Factor Effects when Factors Do 

Not Interact 1014 

Analysis of Factor Effects with Multiple 
Two-Factor Interactions or Three-Factor 
Interaction 1016 

Analysis of Factor Effects with Single 
Two-Factor Interaction 1016 
Example~Estimation of Contrasts 
of Treatment Means 1018 

24.6 Unequal Sample Sizes in Multi-Factor 
Studies 1019 

Tests for Factor Effects 1019 
Inferences for Contrasts of Factor Level 
Means 1020 



Contents xxv 


24.7 Planning of Sample Sizes 1021 
Power ofF Test for Multi-Factor 
Studies 1021 

Use of Table B. 12 for Multi-Factor 
Studies 1021 
Cited Reference 1022 
Problems 1022 
Exercises 1027 
Projects 1027 
Case Studies 1028 

Chapter 25 

Random and Mixed Effects Models 1030 

25.1 Single-Factor Studies — ANOVA 
Modelll 1031 

Random Cell Means Model 1031 
Questions of Interest 1034 
Test whether of = 0 1035 
Estimation offi. 1038 
Estimation ofcr^/ « + <j 2 ) 1040 

Estimation ofcr 2 1041 
Point Estimation ofcr^ 1042 
Interval Estimation ofa^ 1042 
Random Factor Effects Model 1047 

25.2 Two-Factor Studies — ANOVA Models II 
andm 1047 

ANOVA Model II~Random Factor 
Effects 1047 

ANOVA Model III~Mixed Factor 
Effects 1049 

25.3 Two-Factor Studies — ANOVA Tests for 
Models II and HI 1052 

Expected Mean Squares 1052 
Construction of Test Statistics 1053 

25.4 Two-Factor Studies — Estimation 
of Factor Effects for Models II 
and in 1055 

Estimation of Variance Components 1035 
Estimation of Fixed Effects in Mixed 
Model 1056 

25.5 Randomized Complete Block Design: 
Random Block Effects 1060 

Additive Model 1061 
Interaction Model 1064 


25.6 Three-Factor Studies — ANOVA 
Models II and III 1066 

ANOVA Model II~Random Factor 
Effects 1066 

ANOVA Model III~Mixed Factor 
Effects 1066 

Appropriate Test Statistics 1067 
Estimation of Effects 1069 

25.7 ANOVA Models II and III with Unequal 
Sample Sizes 1070 

Maximum Likelihood Approach 1072 
Cited References 1077 
Problems 1077 
Exercises 1085 
Projects 1085 

PART SIX 

SPECIALIZED STUDY 
DESIGNS 1087 

Chapter 26 

Nested Designs, Subsampling，and 
Partially Nested Designs 1088 

26.1 Distinction between Nested and Crossed 
Factors 1088 

26.2 Two-Factor Nested Designs 1091 

Development of Model Elements 1091 
Nested Design Model 1092 
Random Factor Effects 1093 

263 Analysis of Variance for Two-Factor 

Nested Designs 1093 
Fitting of Model 1093 
Sums of Squares 1094 
Degrees of Freedom 1095 
Tests for Factor Effects 1097 
Random Factor Effects 1099 

26.4 Evaluation of Appropriateness of Nested 
Design Model 1099 

26.5 Analysis of Factor Effects in Two-Factor 
Nested Designs 1100 

Estimation of Factor Level Means 

fM. 1100 

Estimation of Treatment Means fiij 1102 
Estimation of Overall Mean fi.. 1103 
Estimation of Variance Components 1103 



xxvi Contents 


26.6 Unbalanced Nested Two-Factor 
Designs 1104 

26.7 Subsampling in Single-Factor Study with 
Completely Randomized Design 1106 

Model 1107 

Analysis of Variance and Tests of 
Effects 1108 

Estimation of Treatment Effects 1110 
Estimation of Variances 1111 

26.8 Pure Subsampling in Three Stages 1113 

Model 1113 

Analysis of Variance 1113 
Estimation of fx.. 1113 

26.9 Three-Factor Partially Nested 
Designs 1114 

Development of Model 1114 
Analysis of Variance 1115 
Cited Reference 1119 
Problems 1119 
Exercises 1125 
Projects 1125 

Chapter 27 

Repeated Measures and Related 
Designs 1127 

27.1 Elements of Repeated Measures 
Designs 1127 

Description of Designs 1127 
Advantages and Disadvantages 1128 
How to Randomize 1128 

71.2. Single-Factor Experiments with Repeated 
Measures on All Treatments 1129 
Model 1129 

Analysis of Variance and Tests 1130 

Evaluation of Appropriateness of Repeated 

Measures Model 1134 

Analysis of Treatment Effects 1137 

Ranked Data 1138 

Multiple Pairwise Testing 

Procedure 1138 

27.3 Two-Factor Experiments with Repeated 
Measures on One Factor 1140 
Description of Design 1140 
Model 1141 

Analysis of Variance and Tests 1142 


Evaluation of Appropriateness of Repeated 

Measures Model 1144 

Analysis of Factor Effects: Without 

Interaction 1145 

Analysis of Factor Effects: With 

Interaction 1148 

Blocking of Subjects in Repeated Measures 
Designs 1153 

27.4 Two-Factor Experiments with Repeated 
Measures on Both Factors 1153 

Model 1154 

Analysis of Variance and Tests 1155 
Evaluation of Appropriateness of Repeated 
Measures Model 1157 
Analysis of Factor Effects 1157 

27.5 Regression Approach to Repeated 
Measures Designs 1161 

27.6 Split-Plot Designs 1162 
Cited References 1164 
Problems 1164 
Exercise 1171 
Projects 1171 

Chapter 28 

Balanced Incomplete Block, Latin Square ， 
and Related Designs 1173 

28.1 Balanced Incomplete Block 
Designs 1173 

Advantages and Disadvantages 
ofBIBDs 1175 

28.2 Analysis of Balanced Incomplete Block 
Designs 1177 

BIBD Model 1177 
Regression Approach to Analysis of 
Balanced Incomplete Block Designs 1177 
Analysis of Treatment Effects 1180 
Planning of Sample Sizes with Estimation 
Approach 1182 

28.3 Latin Square Designs 1183 

Basic Ideas 1183 
Description of Latin Square 
Designs 1184 

Advantages and Disadvantages of Latin 
Square Designs 1185 



Contents xxvii 


Random ization of Latin Square 
Design 1185 

28.4 Latin Square Model 1187 

28.5 Analysis of Latin Square 
Experiments 1188 

Notation 1188 

Fitting of Model 1188 

Analysis of Variance 1188 

Test for Treatment Effects 1190 

Analysis of Treatment Effects 1190 

Residual Analysis 1191 

Factorial Treatments 1192 

Random Blocking Variable Effects 1193 

Missing Observations 1193 

28.6 Planning Latin Square 
Experiments 1193 

Power ofF Test 1193 

Necessary Number of Replications 1193 

Efficiency of Blocking Variables 1193 

28.7 Additional Replications with Latin 
Square Designs 1195 

Replications within Cells 1195 
Additional Latin Squares 1196 

28.8 Replications in Repeated Measures 
Studies 1198 

Latin Square Crossover Designs 1198 
Use of Independent Latin Squares 1200 
Carryover Effects 1201 
Cited References 1202 
Problems 1202 

Chapter 29 

Exploratory Experiments: Two-Level 
Factorial and Fractiona 】 Factorial 
Designs 1209 

29.1 Two-Level Full Factorial 
Experiments 1210 

Design of Two-Level Studies 1210 
Notation 1210 

Estimation of Factor Effects 1212 
Inferences about Factor Effects 1214 

29.2 Analysis of Unreplicated Two-Level 
Studies 1216 

Pooling of Interactions 1218 
Pareto Plot 1219 


Dot Plot 1220 

Normal Probability Plot 1221 

Center Point Replications 1222 

29.3 Two-Level Fractional Factorial 
Designs 1223 

Cortfoimding 1224 
Defining Relation 1227 
Half-Fraction Designs 1228 
Quarter-Fraction and SmaUer-Fraction 
Designs 1229 
Resolution 1231 
Selecting a Fraction of Highest 
Resolution 1232 

29.4 Screening Experiments 1239 

2^ Fractional Factorial Designs 1239 
Plackett-Burman Designs 1240 

29.5 Incomplete Block Designs for Two-Level 
Factorial Experiments 1240 

Assignment of Treatments to Blocks 1241 
Use of Center Point Replications 1243 

29.6 Robust Product and Process 
Design 1244 

Location and Dispersion Modeling 1246 
Incorporating Noise Factors 1250 
Case Study — Clutch Slave Cylinder 
Experiment 1252 
Cited References 1256 
Problems 1256 
Exercises 1266 

Chapter 30 

Response Surface Methodology 1267 

30.1 Response Surface Experiments 1267 

30.2 Central Composite Response Surface 
Designs 1268 

Structure of Central Composite - 
Designs 1268 

Commonly Used Central Composite 
Designs 1270 
Rotatable Central Composite 
Designs 1271 

Other Criteria for Choosing a Central 
Composite Design 1273 
Blocking Central Composite 
Designs 1275 



xxvlii Contents 


Additional General-Purpose Response 
Surface Designs 1276 
303 Optimal Response Surface 
Designs 1276 

Purpose of Optimal Designs 1276 
Optimal Design Approach 1278 
Design Criteria for Optimal Design 
Selection 1279 

Construction of Optimal Response Surface 

Designs 1282 

Some Final Cautions 1283 

30.4 Analysis of Response Surface 
Experiments 1284 

Model Interpretation and 
Visualization 1284 
Response Surface Optimum 
Conditions 1286 

30.5 Sequential Search for Optimum 
Conditions — Method of Steepest 
Ascent 1290 

Cited References 1292 
Problems 1292 
Projects 1295 


Appendix A 

Some Basic Results in Probab 
and Statistics 1297 

Appendix B 

Tables 1315 

Appendix C 

Data Sets 1348 

Appendix D 

Rules for Developing ANOVA 
Tables for Balanced Designs 

Appendix E 

Selected Bibliography 1374 
Index 1385 



Simple Linear 
Regression 


P 


Chapter 


Linear Regression wth One 
Predictor Variable 


o 


ranal 




s a statistical methodology that utilizes the relation between two or 
iables so that a response or outcome variable can be predicted from 


Regressiorf 
more quantitative 

the other, or others. This methodology is widely used in business, the social and behavioral 
sciences, the biological sciences, and many other disciplines. A few examples of applications 
are: 


2 

3. 

4. 


Sales of a product can be prec 
of advertising expenditures. 




Q 


dby utilizing the relationship between sales and amount 


The performance of an employee on a job can be by utilizing the relationship 

between performance and a battery of aptitude tests. < \^ > 


The size of the vocabulary of a child can be predicted by utilizing the relationship 
between size of vocabulary and age of the chilcLanrJ amount of education of the parents. 
The length of hospital stay of a surgical patiei be predicted by utilizing the rela¬ 
tionship between the time in the hospital and the severity of the operation. 


In Part I we take up regression analysis when a single predictor variable is used for 
predicting the response or outcome variable of interest. In Parts II and IH, we consider 
regression analysis when two or more variables are used for making predictions. In this 
chapter, we consider the basic ideas of regression analysis and discuss the estimation of the 
parameters of r^ression models containing a single predictor variable. 



Relations between Variables 


o 


The concept of a relation between two variables, such as between family income and family 
expenditures for housing, is a familiar one. We distinguish l^^en afunctional relation 
and a statistical relation, and consider each of these in turn. 


o 


Functional Relation between Two Variables 


o 


o 


2 


A functional relation between two variables is expressed by a mathematical formula. If X 
denofes-the independent variable and Y the dependent variable ， 这 functional relation is 


o 


























Chapter 1 Linear Regression with One Predictor Variable 3 


100 



0 


50 100 

Units Sold 


150 X 


l 


of the form: 


Y= f(X) 


o 


Given a particular value of X, the function / indicates the corresponding value of Y. 


o 


Example 


Consider the relation between dollar sales (7) of a product sold at a fixed price and number 
of units sold (X). If the selling price is $2 per unit, the relation is expressed by the equation: 

Y = 2X 

This functional relation is shown in Figure 1.1. Number of units sold and dollar sales during 
three recent periods (while the unit price remained constant at $2) were as follows: 


o 



Number of 

Dollar 

Period 

Units Sold 

Sales 

1 

75 

$150 

2 

25 

50 

3 

130 

260 




These observations are plotted also in Figure 1.1. Note that all fall dip 
functional relationship. This is characteristic of all functional relations 


on the line of 


Statistical Relation between Two Variables 


o 


A statistical relation, unlike a functional relation, is not a peife (f^ie. In genei^N 
observations fora statistical relation do not fall directly oirthe curvenorrelationshiji}— 


he 


Example 1 


Performance evaluations for 10 employees were obtained at midyear and at year-end. 
These data are plotted in Figure 1.2a. Year-end evaluations are taken as the dependent or 
response variable Y, and midyear evaluations as the independent, explanatory, or predictor 


o 


FIGURE 1.1 

Example of 

Functional 

Relation. 



wr 


3 


00 

2 

salecoJelloQ 





























4 Part One Simple Linear Regression 



0 60 


70 80 90 

Midyear Evaluation 


X 


0 ' 60 70 80 90 

Midyear Evaluation 


X 


variable X. The plotting is done as before. For instance, the midyear and year-end perfor¬ 
mance evaluations for the first employee are plotted at X = 90, Y = 94. 

Figure 1.2a clearly suggests that there is a relation between midyear and year-end evalua¬ 
tions, in the sense that the higher the midyear evaluation, the higher tends to be the year-end 
evaluation. However, the relation is not a perfect one. There is a scattering of points, sug- 
gesting that some of the variation in year-end evaluations is not accounted for by midyear 
performance assessments. For instance, two employees had midyear evaluations of X = 80, 
yet they received somewhat different year-end evaluations. Because of the scattering of 
points in a statistical (^)lion, Figure 1.2a is called a scatter diagram car scatter plot. In 
statistical terminologjTTeach point in the scatter diagram represents a trial or a case. 

In Figure 1.2b, we have plotted a line of relationship that describes the statistical relation 
between midyear and year-end evaluations. It indicates the general tendency by which year- 
end evaluations vary with the level of midyear performance evaluation. Note that most of 
the points do not fall directly on the line of statistical relationship. This scattering of points 
around the line represents variation in year-end evaluations that is not associated with! 
midyear performance evaluation and that is usually considered to be of a random nature. 
Statistical relations can be highly useful, even though they do not have the exactitude of, a 
ftinctional relation. 


Q 


Example 2 


Q 


O 


Q 


Figure 1.3 presents data on age and level of a steroid in plasma for 27 healthy females 
between 8 and 25 years old. The data strongly suggest that the statistical relationship is 
curvilinear (not linear). The curve of relationship has also been drawn in Figure 1.3. It 
implies that, as age increases, steroid level increases up to a point and then begins to level 
off. Note again the scattering of points around the curve of statistical relationship, typical 
of all statistical relations. 


FIGURE 1.2 Statistical Relation between Midyear Performance Evaluation and Year-End Evaluation. 

⑻ (b) 


Y 


Scatter Plot 


o 


Scatter Plot and Line of Statistical Relationship 








Y 





o o o 

8 7 6 
UOJlenleAUJ-DuuuIJeaA 


u o o 

y8 7 

U01:l^nlr3>uupuuuljr3<u>- 
























1.2 Regression Model^ncl Their Uses 口 


Historical Origins [p| 

Regression analysis was first developed by Sir Francis Gallon in the latter part of the 
19th century. Galton had studied the relation between heights of parents and children z 


noted that the heights of children of both tall and short parents appeared to “revert” I 
^egress” to the mean of the group. He considered this tendency to be a regression to 


o 


nediocrity." Galton developed a mathematical description of this regression tendency, the 
precursor of today’s regression models. 

The term regression persists to this day to describe statistical relations between variables. 


Basic Concepts O 


A regression model is a formal means of expressing the two essential ingredients of a 
statistical relation: 


o 


< \^). A tendency of the response variable Y to vaiy with the predictor variable X in a systematic 
fashion. 

2. A scattering of points around the curve of statistical relationship. 


These two characteristics are embodied in a regression model by postulal^^ 


hat: 


1. There is a probability distribution of Y for each level of X. 

2. The means of these probability distributions vary in some syslematic fashion with X. 


Example Consider again the performance evaluation example in Figure 1.2. The year-end evaluation Y 

—- is treated in a regression model as a random variable. For each level of midyear performance 

evaluation, there is postulated a probability distribution of Y. Figure 1.4 shows such a 
probability distribution forX — 90, which is the midyear evaluation for the first employee. 


Q 


Chapter 1 Linear Regression with One Predictor Variable 5 

FIGURE 1.3 Curvilinear Statistical Relation between Age and Steroid Level in JHealthy Females Aged 8 to 25. 

V 

30 - 






























6 Part One Simple Linear Regression 



Midyear Evaluation 

The actual year-end evaluation of this employee, Y = 94, is then viewed as a random 
selection from this probability distribution. 

Figure 1.4 also shows probability distributions of Y for midyear evaluation levels X = 50 
and X = 70. Note that the means of the probability distributions have a systematic relation 


0 


sion function of Y on X . 【 J 
.Note that in Figure 1.4 


o 


to the level of X. This systematic relationship is called the re^ 

The graph of the regression function is called the regression c 
the r^ression function is slightly curvilinear. This would imply for our example that the in¬ 
crease in the expected (mean) year-end evaluation with an increase in midyear performance 
evaluation is retarded at higher levels of midyear performance. 

Regression models may differ in the form of the r^ression function (linear, curvilinear), 


in the shape of the probability distributions of Y (symmetrical, skewed), and in other ways. 
Whatever the variation, the concept of a probability distribution of 1" for any given X is the 
formal counterpart to the empirical scatter in a statistical relation. Similarly, the regression 
curve, which describes the relation between the means of the probability distributions 
of Y and the level of X, is the counterpart to the general tendency of Y to vary with X 
systematically in a statistical relation. 


Regression Models with More than One Predictor Variable. (^3 egression models may 
contain more than one predictor variable. Three examples follow. 



In an efficiency study of 67 branch offices of a consumer finance chain, the response 
variable was direct operating cost for the year just ended. There were four predictor variables; 
average size of loan outstanding during the year, average number of loans outstanding, total 
number of new loan applications processed, and an index of office salaries. 

2. In a tractor purchase study, the response variable was volume (in horsepower) of 
tractor purchases in a sales territory of a farm equipment firm. There were nine predictor 
variables, including average age of tractors on farms in the territory, number of farms in the 
territory, and a quantity index of crop production in the territory. 

3. In a medical study of short children, the response variable was the peak plasma growth 
irmone level. There were 14 predictor variables, including age, gender, height, weight, 

and 10 skinfold measurements. 


The model features represented in Figure 1.4 must be extended into further dimensions 
when there is more than one predictor variable. With two predictor variables X' and X 2 , 























Chapter 1 Linear Regression with One Predictor Variable 7 


for instance, a probability distribution of Y for each (Xi, X 2 ) combination is assumed 
by the regression model. The systematic relation between the means of these probability 
distributions and the predictor variables X\ and X 2 is then given by a regression surface. 


Construction of Regression Models 


o 


Selection of Predictor Variables. Since reality must be reduced to manageable propor- 
ions whenever we construct models, only a limited number of explanatory or predictor 
variables can — or should~be included in a regression model for any situation of interest. 
A central problem in many exploratory studies is therefore that of choosing, for a regres¬ 
sion model, a set of predictor variables that is “good” in some sense for the purposes of 
the analysis. A major consideration in making this choice is the extent to which a chosen 
variable contributes to reducing the remaining variation in Y after allowance is made for 
the contributions of other predictor variables that have tentatively been included in the 
regression model. Other considerations include the importance of the variable ks a causal 
agent in the process under analysis; the degree to which observations on the variable can 
be obtained more accurately, or quickly, or economically than on competing variables; and 
the degree to which the variable can be controlled. In Chapter 9, we will discuss procedures 
and problems in choosing the predictor variables to be included in the regression model. 


2J, 


’unctional Form of Regression Relation. The choice of the functional form of the 
regression relation is tied to the choice of the predictor variables. Sometimes, relevant theory 
may indicate the appropriate functional form. Learning theory, for instance, may indicate 
that the regression function relating unit production cost to the number of previous times the 
item has been produced should have a specified shape with particular asymptotic properties. 

_ More frequently, however, the functional form of the regression relation is not known in 

(^) advance and must be decided upon empirically once the data have been collected. Linear 
or quadratic regression functions are often used as satisfactory first approximations to 
regression functions of unknown nature. Indeed, these simple types of regression functions 
may be used even when theory provides the relevant functional form, notably when the 
known form is highly complex but can be reasonably approximated by a linear or quadratic 
regression function. Figure 1.5a illustrates a case where the complex regression function 


o 


FIGURE 1.5 Uses of Linear Regression Functions to Approximate Complex Regression 
Functions — Bold the True Regression Function and Dotted Line Is the Regression 

Approximation. < \^) 

(a) Linear Approximation (b) Piecewise Linear Approximation 





















8 Part One Simple Linear Regression 


may be reasonably approximated by a linear regression function. Figure 1.5b provides an 
example where two linear regression functions maybe used “piecewise” to approximate a 
complex regression function. 

(^) Scope of Model. In formulating a regression model, we usually need to restrict the cov¬ 
erage of the model to some interval or region of values of the predictor variable(s). The 
scope is determined either by the design of the investigation or by the range of data at hand. 
For instance, a company studying the effect of price on sales volume investigated six price 
levels, ranging from $4.95 to $6.95. Here, the scope of the model is limited to price levels 
ranging from near $5 to near $7. The shape of the r^ression function substantially outside 
this range would be in serious doubt because the investigation provided no evidence as to 


the nature of the statistical relation below $4.95 or above $6.95. 


o 


Uses of Regression Analysis 


o 


o 


Regression analysis serves three major purposes: (1) description, (2) control, and (3) predic¬ 
tion. These purposes are illustrated by the three examples cited earlier. The tractor purchase 
study served a descriptive purpose. In the study of branch office operating costs, the main 
purpose was administrative control; by developing a usable statistical relation between cost 
and the predictor variables, management was able to set cost standards for each branch office 
in the company chain. In the medical study of short children, the purpose was prediction. 
Clinicians were able to use the statistical relation to predict growth hormone deficiencies 
in short children by using simple measurements of the children. 

The several purposes of regression analysis frequently overlap in practice. The branch 
office example is a case in point. Knowledge of the relation between operating cost and 
characteristics of the branch office not only enabled management to set cost standards for 
each office but management could also predict costs, and at the end of the fiscal year it 
could compare the actual branch cost against the expected cost. 


Regression and Causality 


o 


(^) Ihe existence of a statistical relation between the response variable Y and the explanatory or 

- predictor variable X does not in^ply in any way thatF depends causally on X. No matter how 

strong is the statistical relation between X and Y, no cause-and-effect pattern is necessarily 
implied by the regression model. For example, data on size of vocabulary (Z) and writing 
speed (F) for a sample of young children aged 5-10 will show a positive regression relation. 
This relation does not imply, however, that an increase in vocabulary causes a faster writing 
speed. Here, other explanatory variables, such as age of the child and amount of education, 
affect both the vocabulary (X) and the writing speed (F). Older children have a larger 
vocabulary and a faster writing speed. 


(^) Even when a strong statistical relationship reflects causal conditions, the causal condi- 

- tions may act in the opposite direction, from Y to X. Consider, for instance, the calibration 

of a thermometer. Here, readings of the thermometer are taken at different known tempera¬ 
tures, and the regression relation is studied so that the accuracy of predictions made by using 
the thermometer readings can be assessed. For this purpose, the thermometer reading is the 
predictor variable X, and the actual temperature is the response variable Y to be predicted. 
However, the causal pattern here does not go from X to Y, but in the opposite direction: the 
actual temperature (F) affects the thermometer reading (X). 
















Chapter 1 Linear Regression with One Predictor Variable 9 


P|：hesc examples demonstrate the need for care in drawing conclusions about causal 
relations from regression analysis. Regression analysis by itself provides no information 
about causal patterns and must be supplemented by additional analyses to obtain insights 
about causal relations. 


Use of Computers 

Because regression analysis often entails lengthy and tedious calculations, computers are 

- usually utilized to perform the necessary calculations. Almost every statistics package for 

computers contains a regression component. While packages differ in many details, their 
basic regression output tends to be quite similar. 

After an initial explanation of required regression calculations, we shall rely on computer 
calculations for all subsequent examples. We illustrate computer output by presenting output 
and graphics from BMDP (Ref. 1.1), MINITAB (Ref. 1.2), SAS (Ref. 1.3), SPSS (Ref. 14 )， 

SYSTAT (Ref. 1.5)，JMP (Ref. 1.6), S-Plus (Ref. 1.7), and MATLAB (Ref. 1.8). 

h 


1.3 Simple Linear Regression Model with Distribution 
of Error Terms Unspecified Q> 


Formal Statement of Model 

Part I we consider a basic regression model where there is only one predictor variable 
and the regression function is linear. The model can be stated as follows: 




( 1 . 1 ) 


O 


where; 


Q 


is the value of the response variable in the /th trial 
and are parameters 


X t is a known constant, namely, the value of the predictor variable in the ith trial 
6i is a random error term with mean E[£i) = 0 and variance C7 2 {e ( } = a 2 ; and Sj are 
uncorrelated so that their covariance is zero (i.e., o {e f> = 0 for all / ， j;i ^ j) 

i 1 ， ■. ‘，《 

(^) Regression model (1.1) is said to be simple, linear in the parameters, and linear in the 
predictor variable. It is “simple” in that there is only one predictor variable, “linear in the 
parameters,” because no parameter appears as an exponent or is multiplied or divided by 
another parameter, and “linearin the predictor variable" because this variable appears only 
in the first power. A model that is linear in the parameters and in the predictor variable is 
also called & first-order model. 


Important Features of Model 

1. The response Y { in the /th trial is the sum of two components: (1) the constant term 
j8o + ^\X[ and (2) the random term 为 . Hence, R is a random variable. 

2. Since E[sj } = 0, it follows from (A. 13c) in Appendix A that: 

E{Yi) = E[^o + fi\Xi + e,} = A) + +_£{&} = Ad + 

Note that plays the role of the constant a in (A. 13c). 
























10 Part One Simple Linear Regression 


Example 


Thus, the response Yf, when the level of X in the /th trial is X t , comes from a probability 
distribution whose mean is: 


E{Y i } = ^ + ^X i ( 1 . 2 ) 

We therefore know that the regression function for model (1.1) is: 

E{Y}=p 0 + ^X (1.3) 

since the regression function relates the means of the probability distributions of Y for given 
X to the level of X. 

3. The response Y t in the ith trial exceeds or falls short of the value of the regression 
function by the error term amount e 卜 

4. The error terms s, are assumed to have constant variance a 2 . It therefore follows that 
the responses F,- have the same constant variance: 

cr 2 {Yi} = a 2 (1.4) 


since, using (A. 16a), we have: 

°' 2 [^0 + + £;} = ^[ Si ] = a 2 

Thus, regression model (1.1) assumes that the probability distributions of Y have the same 
variance cr 2 , regardless of the level of the predictor variable X. 

5. The error terms are assumed to be uncorrelated. Since the error terms Si and sj are 
uncorrelated, so are the responses Yi and Yj. 

6. In summary, regression model (1.1) implies that the responses Y f come from proba¬ 
bility distributions whose means are E[Yi} — + and whose variances are a 2 , the 

same for all levels of X. Further, any two responses Yi and Yj are uncorrelated. 

A consultant for an electrical distr >r is studying the relationship between the number 
of bids requested by construction contractors for basic lighting equipment during a week 
and the time required to prepare the bids. Suppose that regression model (1.1) is applicable 
and is as follows: 


Yi = 9.5 + 2.\Xi + si 

where X is the number of bids prepared in a week and Y is the number of hours required to 
prepare the bids. Figure 1.6 contains a presentation of the regression function: 

E{Y} =9.5+ 2.1X 

Suppose that in the ith week, X ； — 45 bids are prepared and the actual number of hours 
required is Yj = 108. In that case, the error term value is e ； = 4, for we have 

E{Yi} = 9.5 + 2.1(45)= 104 


and 


Yi = 108= 104 + 4 

Figure 1.6 displays the probability distribution of Y when X = 45 and indicates from 
where in this distribution the observation Y； — 108 came. Note again that the error term s t 
is simply the deviation of Y t from its mean value E{Yi). 

fQl 同 










Chapter 1 Linear Regression with One Predictor Variable 11 


FIGURE 1.6 
Illustration of 
Simple Linear 
Regression 
Model (1.1). 



FIGURE 1.7 
Meaning of 
Parameters of 
Simple Linear 
Regression 
Model (1.1). 



% 


Meaning of RegJ2> 


Figure 1.6 also shows the probability distribution of Y when X — 25. Note that this 
distribution exhibits the same variability as the probability distribution when X = 45, in 
conformance with the requirements of regression model (1.1). 


The parar 


I0Q 


Q 


Parameters 


Q 


O 


s and in regression model (1.1) are cd^^r-rdgression cu^jwients. 


Q 


is the slope of the regression line. It indicates the change in the mean of the probability 
distribution of Y per unit increase in X. The parameter Pq is the Y intercept of the regression 
line. When the scope of the model includes X = 0, gives the mean of the probability 
distribution of Y o.t X = 0. When the scope of the model does not cover Z = 0, j6o does 
not have any particular meaning as a separate term in the regression model. 


Example 


Figure 1.7 shows the regression functioi 


Q 


E{Y} = 9.5+ 2AX 


for the electrical distributor example. The slope ^ = 2.1 indicates that the preparation of 
one additional bid in a week leads to an increase in the mean of the probability distribution 
of Y of 2.1 hours. 

The intercept fo = 9.5 indicates the value of the regression function at Z = 0. However, 
since the linear regression model was formulated to apply to weeks where the number of 

































12 Part One Simple Linear Regression 


bids prepared ranges from 20 to 80, 你 does not have any intrinsic meaning of its own 
here. If the scope of the model were to be extended to X levels near zero, a model with 
a curvilinear regression function and some value of different from that for the linear 
regression function might well be required. 

Alternative Versions of Regression Model 

Sometimes it is convenient to write the simple linear regression model (LI) in somewhat 
different, though equivalent, forms. Let X 。 be a constant identically equal to 1. Then, we 
can write (1.1) as follows: 

Y ( = A)X 0 + ^Xi + Si where X 0 = 1 ( 1 . 5 ) 

This version of the model associates an X variable with each regression coefficient. 

An alternative modification is to use for the predictor variable the deviation Z f — X 
rather than X；. To leave model (1.1) unchanged, we need to write; 

乃=汍+约(足一 + + ~ 

— Wo + A 幻 + A (足 ■ — X) + ei 

=(Xi — X) a, 

Thus, this alternative model version is: 

+ + ~ ( 1 . 6 ) 

where: 

«： = A) + A 尤 (1.6a) 

We use models (1.1), (1.5), and (1.6) interchangeably as convenience dictates. 


1.4 Data for Regression Analysis 

Ordinarily, we do not know the values of the regression parameters and in regression 
model (1.1), and we need to estimate them from relevant data. Indeed, as we noted earlier, we 
frequently do not have adequate a priori knowledge of the appropriate predictor variables 
and of the functional form of the regression relation (e.g., linear or curvilinear), and we 
need to rely on an analysis of the data for developing a suitable regression model. 

Data for regression analysis may be obtained from nonexperimental or experimental 
studies. We consider each of these in turn. 

Observational Data 

Observational data are data obtained from nonexperimental studies. Such studies do not 
control the explanatory or predictor variable(s) of interest. For example, company officials 
wished to study the relation between age of employee (X) and number of days of illness 
last year (F). The needed data for use in the regression analysis were obtained from per¬ 
sonnel records. Such data are observational data since the explanatory variable, age，is not 
controlled. 

Regression analyses axe frequently based on observational data, since often it is not 
feasible to conduct controlled experimentation. In the company personnel example just 
mentioned, for instance，it would not be possible to control age by assigning ages to persons. 





Chapter 1 Linear Regression with One Predictor Variable 13 


A major limitation of observational data is that they often do not provide adequate infor¬ 
mation about cause-and-effect relationships. For example, a positive relation between age of 
employee and number of days of illness in the company personnel example may not imply 
that number of days of illness is the direct result of age. It might be that younger employees 
of the company primarily work indoors while older employees usually work outdoors, and 
that work location is more directly responsible for the number of days of illness than age. 

Whenever a regression analysis is undertaken for purposes of description based on ob¬ 
servational data, one should investigate whether explanatory variables other than those con¬ 
sidered in the regression model might more directly explain cause-and-effect relationships. 

Experimental Data 

Frequently, it is possible to conduct a controlled experiment to provide data from which the 
regression parameters can be estimated. Consider, for instance, an insurance company that 
wishes to study the relation between productivity of its analysts in processing claims and 
length of training. Nine analysts are to be used in the study. Three of them will be selected 
at random and trained for two weeks, three for three weeks, and three for five weeks. 
The productivity of the analysts during the next 10 weeks will then be observed. The data 
so obtained will be experimental data because control is exercised over the explanatory 
variable, length of training. 

When control over the explanatory variable(s) is exercised through random assignments ， 
as in the productivity study example, the resulting experimental data provide much stronger 
information about cause-and-effect relationships than do observational data. The reason is 
that randomization tends to balance out the effects of any other variables that might affect 
the response variable, such as the effect of aptitude of the employee on productivity. 

In the terminology of experimental design, the length of training assigned to an analyst in 
the productivity study example is called a treatment. The analysts to be included in the study 
are called the experimental units. Control over the explanatory variable(s) then consists of 
assigning a treatment to each of the experimental units by means of randomization. 

Completely Randomized Design 

The most basic type of statistical design for making randomized assignments of treatments to 
experimental units (or vice versa) is the completely randomized design. With this design, the 
assignments are made completely at random. This complete randomization provides that all 
combinations of experimental units assigned to the different treatments are equally likely, 
which implies that every experimental unit has an equal chance to receive any one of the 
treatments. 

A completely randomized design is particularly useful when the experimental units are 
quite homogeneous. This design is verjf flexible; it accommodates any number of treatments 
and permits different sample jizes for different treatments. Its chief disadvantage is that, 
when the experimental units are heterogeneous, this design is not as efficient as some other 
statistical designs. 

1.5 Overview of Steps in Regression Analysis_ 

The regression models considered in this and subsequent chapters can be utilized either 
for observational data or for experimental data from a completely randomized design. 
(Regression analysis can also utilize data from other types of experimental designs, but 



14 Part One Simple Linear Regression 


the regression models presented here will need to be modified.) Whether the data are 
observational or experimental, it is essential that the conditions of the regression model be 
appropriate for the data at hand for the model to be applicable. 

We begin our discussion of regression analysis by considering inferences about the re¬ 
gression parameters for the simple linear regression model (1.1). For the rare occasion 
where prior knowledge or theory alone enables us to determine the appropriate regression 
model, inferences based on the regression model are the first step in the regression analysis. 
In the usual situation, however, where we do not have adequate knowledge to specify the 
appropriate regression model in advance, the first step is an exploratory study of the data, 
as shown in the flowchart in Figure L8. On the basis of this initial exploratory analysis, 
one or more preliminary regression models are developed. These regression models are 
then examined for their appropriateness for the data at hand and revised, or new models 


FIGURE 1.8 
Typical 
Strategy for 
Regression 
Analysis. 



Make inferences 
on basis of 
regression model 


Stop 





Chapter 1 Linear Regression with One Predictor Variable 15 


are developed, until the investigator is satisfied with the suitability of a particular regres¬ 
sion model. Only then are inferences made on the basis of this regression model, such as 
inferences about the regression parameters of the model or predictions of new observations. 

We begin, for pedagogic reasons, with inferences based on the regression model that is 
finally considered to be appropriate. One must have an understanding of regression models 
and how they can be utilized before the issues involved in the development of an appropriate 
regression model can be fully explained. 

1.6 Estimation of Regression Function_ 

The observational or experimental data to be used for estimating the parameters of the 
regression function consist of observations on the explanatory or predictor variable X and 
the corresponding observations on the response variable Y. For each trial, there is an X 
observation and a Y observation. We denote the (X, F) observations for the first trial as 
(Xj, Fj), for the second trial as (X 2 , Y 2 ), and in general for the ith trial as (X ; , F；), where 
i = 1,..., w. 

FxamDle In a small-scale study of persistence, an experimenter gave three subjects a very difficult 

-- task. Data on the age of the subject (X) and on the number of attempts to accomplish the 

task before giving up (7) follow: 


Subject /: 

1 

2 

3 

Age X；: 

20 

55 

30 

Number of attempts Y；: 

5 

12 

10 


In terms of the notation to be employed, there were n = 3 subjects in this study, the 
observations for the first subject were (Xi, Yi) = (20, 5), and similarly for the other 
subjects. 

Method of Least Squares 

To find “good” estimators of the regression parameters j6 0 and 仇 ， we employ the method 
of least squares. For the observations (X- t , Y- t ) for each case, the method of least squares 
considers the deviation of Yi from its expected value: 

Y 1 -W 0 + M) ( 1 . 7 ) 

In particular, the method of least squares requires that we consider the sum of the 卩 squared 
deviations. This criterion is denoted by Q: 

n 

Q = (1-8) 

；=L " 

According to the method of least squares, the estimators of j6o and are those values 
bo and b u respectively, that minimize the criterion Q for the given sample observations 
(X lf Y { ), (X 2 , Y 2 ) ， … ， (X n , F„). 


16 Part One Simple Unear Regression 





Example 


Figure 1.9a presents the scatter plot of the data for the persistence study example and the 
regression line that results when we use the mean of the responses (9.0) as the predictor 
and ignore X: 

F = 9.0 + 0(X) 

Note that this regression line uses estimates bo = 9.0 and b { = 0, and that Y denotes 
the ordinate of the estimated regression line. Clearly, this regression line is not a good 
fit, as evidenced by the large vertical deviations of two of the Y observations from the 
corresponding ordinates F of the regression line. The deviation for the first subject, for 
which (X 1? Fi) = (20,5), is: 

d 0 + ^^) = 5- [9.0 + 0(20)] = 5 - 9.0 = -4 

The sum of the squared deviations for the three cases is: 

Q = (5 - 9.0) 2 + (12 - 9.0) 2 + (10 - 9.0) 2 = 26.0 

Figure 1.9b shows the same data with the regression line: 

F = 2.81 + .177X 

The fit of this regression line is clearly much better. The vertical deviation for the first case 
now is: 


Y v - (b 0 + b l X l ) = 5 - [2.81 + .177(20)] =5 - 6.35 = -1.35 

and the criterion Q is much reduced: 

Q = (5 - 6.35) 2 + (12 - 12.55) 2 + (10 - 8.12) 2 = 5.7 

Thus, a better fit of the regression line to the data corresponds to a smaller sum <2- 

The objective of the method of least squares is to find estimates bo and bi for 仇 and 仇， 
respectively, for which <2 is a minimum. In a certain sense, to be discussed shortly, these 


FIGURE 1.9 Illustration of Least Squares Criterion Q for Fit of a Regression Line — Perastence Study 
Example. 


K = 9.0 + 0(X) 
Q = 26.0 


Y= 2.81 + .1 77X 
Q=5J~ 


9 6 


s-Ea-v 


9 6 


sldlua-v 




Chapter 1 Linear Regression with One Predictor Variable 17 


estimates will provide a “good” fit of the linear regression function. The regression line in 
Figure 1.9b is, in fact, the least squares regression line. 

Least Squares Estimators. The estimators bo and & 【 that satisfy the least squares criterion 
can be found in two basic ways: 

1. Numerical search procedures can be used that evaluate in a systematic fashion the least 
squares criterion Q for different estimates b G and b\ until the ones that minimize Q are 
found. This approach was illustrated in Figure 1.9 for the persistence study example. 

2. Analytical procedures can often be used to find the values of bo and b\ that minimize 
Q. The analytical approach is feasible when the regression model is not mathematic ally 
complex. 

Using the analytical approach, it can be shown for regression model (1.1) that the values 
bo and b\ that minimize Q for any particular set of sample data are given by the following 
simultaneous equations: ” 

Y^yi=nb^b x Y^ x i (1.9a) 

(1.9b) 

Equations (1.9a) and (1.9b) are called normal equations; bo and bi are called point esti¬ 
mators of and 仇， respectively. 

The normal equations (1.9) can be solved simultaneously for bo and bi ： 

u _ TXXi- XKYt - Y) 

1 _ m r i ) 2 

where X and Y are the means of the X t and the Y- t observations, respectively. Computer 
calculations generally are based on many digits to obtain accurate values for and b\. 


(1.10a) 

(1.10b) 


Comment 

The normal equations (1.9) can be derived by calcu 


o 


-'or given sample observations (X；, Y ;)，the 


quantity Q in (1.8) is a function of and The values of and 爲 that minimize Q can & derived 
by differentiating (1.8) with respect to and ^ r . We obtain: 


o 


9G 

Wo 

Wi 


-2^ ⑺一汍一灼曷） 

f 

■2 Xi ( 7 ； — fio — Xt) 


We then set these partial derivaliv^s equal to zero, using be and b\ to denote the particular values of 
j6o and that minimize Q: 


o 


o 


~ b Q ~ b { Xt) = 0 













18 Part One Simple Linear Regression 


Simplifying, we obtain: 


Expanding, we have: 


II 

^2<J i ~b 0 ~b l X；) = 0 

i=\ 

n 

- b Q - b x Xi) = Q 


nbo - fc, ^ X ； = 0 

Y^ XiYi ~ b °^2 Xi ~ bt ^2 x ^ = 0 


from which the normal equations (1.9) are obtained by rearranging terms. 

A test of the second partial derivatives will show that a minimum is obtained with the least squares 
estimators and b\. ■ 


Properties of Least Squares Estimators. An important theorem, called the Gauss- 
Markov theorem, states: 


Under the conditions of regression model (1.1), the least squares 
estimators b G and by in (1.10) are unbiased and have minimum (1.11) 

variance among all unbiased linear estimators. 

This theorem, proven in the next chapter, states first that bo and 匕 are unbiased estimators. 
Hence: 


E{b 0 } — E{b\} = 

so that neither estimator tends to overestimate or underestimate systematically. 

Second, the theorem states that the estimators and b\ are more precise (i.e., their 
sampling distributions are less variable) than any other estimators belonging to the class of 
unbiased estimators that are linear functions of the observations Fj,..., The estimators 
bo andbi are such linear functions of the 1^. Consider, for instance, b\. We have from (1.10a): 


E ( 足 + - X)(Yt - Y) 

E (^- - ^) 2 


It will be shown in Chapter 2 that this expression is equal to: 


where: 


E ( 足 •- X)Y t 
m 幻 2 




Xi-X 

E(d 2 


Since the h t are known constants (because the X,- are known constants), b\ is a linear 
combination of the Y- t and hence is a linear estimator. 



Chapter 1 Linear Regression with One Predictor Variable 19 


Example 


In the same fashion, it can be shown that bo is a linear estimator. Among all linear 
estimators that are unbiased then, bo andbi have the smallest variability in repeated samples 
in which the X levels remain unchanged. 

The Toluca Company manufactures refrigeration equipment as well as many replacement 
parts. In the past, one of the replacement parts has been produced periodically in lots of 
varying sizes. When a cost improvement program was undertaken, company officials wished 
to determine the optimum lot size for producing this part. The production of this part involves 
setting up the production process (which must be done no matter what is the lot size) and 
machining and assembly operations. One key input for the model to ascertain the optimum 
lot size was the relationship between lot size and labor hours required to produce the lot 
To determine this relationship, data on lot size and work hours for 25 recent production 
runs were utilized. The production conditions were stable during the six-mo nth period in 
which the 25 runs were made and were expected to continue to be the same during the 
next three years, the planning period for which the cost improvement program^was being 
conducted. 

Table 1.1 contains a portion of the data on lot size and work hours in columns 1 and 
2. Note that all lot sizes are multiples of 10, a result of company policy to facilitate the 
administration of the parts production. Figure 1.10a shows a SYSTAT scatter plot of the 
data. We see that the lot sizes ranged from 20 to 120 units and that none of the production 
runs was outlying in the sense of being either unusually small or large. The scatter plot also 
indicates that the relationship between lot size and work hours is reasonably linear. We also 
see that no observations on work hours are unusually small or large, with reference to the 
relationship between lot size and work hours. 

To calculate the least squares estimates bo and b\ in (1.10), we require the deviations 
Xi — X and Y t — Y. These are given in columns 3 and 4 of Table 1.1. We also require 
the cross-product terms (Xi — X)(Yi — Y) and the squared deviations (Xi — X) 2 ; these 
are shown in columns 5 and 6. The squared deviations (F ； — f) 2 in column 7 are for 
later use. 


TABLE 1.1 Data on Lot Size and Work Hours and Needed Calculations for Least Squares Estimates — Toluca 


Company Example. 



⑴ 

(2) 

(3) 

(4) 


Lot 

Work 



Run 

Size 

Hours 



/ 

X ； 

Y } 

X；- X 

Yr-Y 

1 

80 

399 

10 

86.72 

2 

30 

121 

-40 

-191.28 

3 

50 

221 

-20 

- -91.28 

23 

40 

244 

-30 

-68.28 

24 

80 

342 

10 

29.72 

25 

70 

323 

0 

10.72 

Total 

1,750 

7,807 

0 

0 

Mean 

70.0 

312.28 




(5) 

⑹ 

⑺ 

(Xj-XW-Y) 

(X } - X) 2 

(Y ； - Y) 2 

867.2 

100 

7,520.4 

7,651.2 

1,600 

36,588.0 

1,825.6 

400 

8,332.0 

2,048.4 

900 

4,662.2 

297:2 , 

100 

883.3 

0.0 

0 

114.9 

70,690 

19,800 

307,203 




20 Part One Simple Linear Regression 


(b) Fitted Regression Line 


600 r 


500 



⑻ Scatter Plot 


© 


參 








• % 








® © 






FIGURE 1.10 

SYSTAT 
Scatter Plot 
and Fitted 
Regression 
Line — Toluca 
Company 
Example. 



0 

50 100 150 0 

50 



Lot Size 


Lot Size 

FIGURE 1.11 

The regression equation is 



Portion of 

Y = 62.4 + 

3.57 X 



MINITAB 

Regression 

Predictor 

Coef Stdev 

t-ratio 

P 

Output — 

Constant 

62.37 26.18 

2.38 

0.026 

Toluca 

X 

3,5702 0.3470 

10.29 

0.000 

Company 

Example. 

s = 48.82 

E-sq = 82. 2% 

E-sq(adj)= 

81.4% 


100 


150 


We see from Table 1.1 that the basic quantities needed to calculate the least squar 
estimates are as follows: 

- 尤)⑺ -70,690 

Y^i^i - 尤 ) 2 = 19,800 
X = 70.0 
Y = 312.28 


Using (1.10) we obtain: 

by = 


E(^- - - n 70,690 


3.5702 


Y.iXi - X) 2 19,800 

b 0 = Y-b l X = 312.28 - 3.5702(70.0) = 62.37 

Thus, we estimate that the mean number of work hours increases by 3.57 hours for each 
additional unit produced in the lot. This estimate applies to the range of lot sizes in the 
data from which the estimates were derived, namely to lot sizes ranging from about 20 to 
about 120. 

Figure 1.11 contains a portion of the MINITAB regression output for the Toluca Company 
example. The estimates bo and b { are shown in the column labeled Coef, corresponding to 



o o o o 

4 3 2 1 
SJnoH 


/V /I /I Ji /I 

o o o o o o 

VO 5 4 3 2 1 
SJn 0 j-l 



Chapter 1 Linear Regression with One Predictor Variable 21 


the lines Constant and X, respectively. The additional information shown in Figure 1.11 
will be explained later. 

Point Estimation of Mean Response 

Estimated Regression Function. Given sample estimators bo and b { of the parameters 
in the regression function (1.3): 


E{Y} = ^ 0 + ^X 

we estimate the r 碟 ression function as follows: 

f = b 0 + biX (1.12) 

where Y (read Y hat) is the value of the estimated regression function at the level X of the 
predictor variable. 

We call a value of the response variable a response and E{Y) the mean response. Thus, 
the mean response stands for the mean of the probability distribution of Y corresponding 
to the level X of the predictor variable. Y then is a point estimator of the mean response 
when the level of the predictor variable is X. It can be shown as an extension of the Gauss- 
Markov theorem (1.11) that Y is an unbiased estimator of E[Y), with minimum variance 
in the class of unbiased linear estimators. 

For the cases in the study, we will call f；: 

Y- t =b 0 + biXi i = (1.13) 

thefitted value for the ith case. Thus, the fitted value F； is to be viewed in distinction to the 
observed value 

For the Toluca Company example, we found that the least squares estimates of the regression 
coefficients are: 


b 0 = 62.37 b x = 3.5702 
Hence, the estimated regression function is: 

Y = 62.37 + 3.5702X 

This estimated regression function is plotted in Figure 1.10b. It appears to be a good 
description of the statistical relationship between lot size and work hours. 

To estimate the mean response for any level X of the predictor variable, we simply 
substitute that value of X in the estimated regression function. Suppose that we are interested 

in the mean number of work hours required when the lot size is X = 65; our point estimate is: 

1 

¥ = 62.37 + 3.5702(65) = 294.4 

Thus, we estimate that the mean number of work hours required for production runs of 
X = 65 units is 294.4 hours. We interpret this to mean that if many lots of 65 units are 
produced under the conditions of the 25 runs on which the estimated regression function is 
based, the mean labor time for these lots is about 294 hours. Of course, the labor time for 
any one lot of size 65 is likely to fall above or below the mean response because of inherent 
variability in the production system^ as represented by the error term in the model. 



22 Part One Simple Linear Regression 


TABLE 1.2 

Fitted Values, 


(1) 

(2) 

(3) 

Estimated 

(4) 

(5) 

Residuals, and 
Squared 
Residuals — 
Toluca 


Lot 

Work 

Mean 


Squared 

Run 

Size 

Hours 

Response 

Residual 

Residual 

/ 

X/ 

Y, 

f ： 

Yi- ?r=e ; 

(H) 2 = ef 

Company 

1 

80 

399 

347.98 

51.02 

2,603.0 

Example. 

2 

30 

121 

169.47 

-48.47 

2,349.3 

3 

50 

221 

240.88 

-19.88 

395.2 


23 

40 

244 

■ _ « 

205.17 

■ « « 

38.83 

1,507.8 


24 

80 

342 

347.98 

-5.98 

35.8 


25 

70 

323 

312.28 

10.72 

114.9 


Total 

1,750 

7,807 

7,807 

0 

54,825 


Fitted values for the sample cases are obtained by substituting the appropriate X values 
into the estimated regression function. For the first sample case, we have X\ = 80. Hence, 
the fitted value for the first case is: 

Yx = 62.37 + 3.5702(80) = 347.98 

This compares with the observed work hours of Yy = 399. Table 1.2 contains the observed 
and fitted values fora portion of the Toluca Company data in columns 2 and 3, respectively. 

Alternative Model (1.6). When the alternative regression model (1.6): 

is to be utilized, the least squares estimator bi of remains the same as before. The least 
squares estimator of 段 =+ ^\X becomes, from (1.10b): 

b^ = b 0 + b x X = (Y- b { X) + ^X = ? (1.14) 

Hence, the estimated regression function for alternative model (1.6) is: 

Y = Y + bi(X-X) (1.15) 

In the Toluca Company example, Y — 312.28 and X = 70.0 (Table 1.1). Hence, the 
estimated regression function in alternative form is: 

Y = 312.28 + 3.5702(X - 70.0) 

For the first lot in our example, Xi = 80; hence, we estimate the mean response to be: 

f, = 312.28 + 3.5702(80 - 70.0) = 347.98 
which, of course, is identical to our earlier result. 

Residuals 


The /th residual is the difference between the observed value Y t and the corresponding fitted 
value Yi ，This residual is denoted by e； and is defined in general as follows: 


= Yi- Y t 


0-16) 



Chapter 1 Linear Regression with One Predictor Variable 23 



匕 = 399 


0 ’ 


30 


80 


X 


Lot Size 


For regression model (1.1), the residual ei becomes: 

e; = Yi — (bo + b\X{) = Y t — — b[X[ 




(1.16a) 


The calculation of the residuals for the Toluca Company example is shown for a portion 
of the data in Table 1.2. We see that the residual for the first case is: 

ei = Yi-Yi= 399 - 347.98 51.02 

The residuals for the first two cases are illustrated graphically in Figure 1.12. Note in 
this figure that the magnitude of a residual is represented by the vertical deviation of the F ； 
observation from the corresponding point on the estimated regression function (i.e., from 
the corresponding fitted value Y；). 

We need to distinguish between the model error term value = Y t — E{Yi} and the 
residual e ； = F,- — f The former involves the vertical deviation of Y t from the unknown 
true regression line and hence is unknown. On the other hand, the residual is the vertical 
deviation of Y t from the fitted value on the estimated regression line, and it is known. 

Residuals are highly useful for studying whether a given regression model is appropriate 
for the data at hand. We discuss this use in Chapter 3. 

Properties of Fitted Regression Line 

The estimated regression line (1.12) fitted by the method of least squares has a number of 
properties worth noting. These properties of the least squares estimated regression function 
do not apply to all regression models, as we shall see in Chapter 4. 


1. The sum of the residuals is zero: 


=0 


(1.17) 


Table 1.2, column 4, illustrates this property for the Toluca^Company examplbrKDunding 
errors may, of course, be present in any particular case, resulting in a sum of the residuals 
that does not equal zero exactly. 

2. The sum of the squared residuals, ej, is a minimum. This was the requirement to 
be satisfied in deriving the least squares estimators of the regression parameters since the 


FIGURE 1.12 

Illustration of 
Residuals — 
Toluca 
Company 
Ebtample (not 
drawn to 
scale). 


VI 




SJnox 












24 Part One Simple Linear Regression 


criterion Q in (1.8) to be minimized equals ef when the least squares estimators bo and 
by are used for estimating and 卢卜 

3. The sum of the observed values Y t equals the sum of the fitted values Yi ： 

= 0 - 18 ) 

i=l i=l 

This property is illustrated in Table 1.2, columns 2 and 3, for the Toluca Company example. 
It follows that the mean of the fitted values Yi is the same as the mean of the observed 
values F；, namely, Y. 

4. The sum of the weighted residuals is zero when the residual in the ith trial is weighted 
by the level of the predictor variable in the ith trial: 

n 

= 0 ( 1 - 19 ) 

i=l 

5. A consequence of properties (1.17) and (1 • 19) is that the sum of the weighted residuals 
is zero when the residual in the ith trial is weighted by the fitted value of the response variable 
for the ith trial: 

n 

= o (i.20) 

i=l 

6. The regression line always goes through the point (X, f )• 

Comment 

The six properties of the fitted regression line follow directly from the least squares normal equa¬ 
tions (1.9). For example, property 1 in (1.17) is proven as follows: 

〉: ei — 〉 — bo — bfXi') - 〉: Y/ — nb 。 — b \ 〉: X[ 

= 0 by the first normal equation (1.9a) 

Property 6, that the regression line always goes through the point (X ， Y), can be demonstrated 
easily from the alternative form (1.15) of the estimated regression line. When X = X, we have: 

Y = Y + biCX-X) = F+fc,(X-X) = ? m 

1.7 Estimation of Error Terms Variance a 2 


The variance a 2 of the error terms Si in regression model (1.1) needs to be estimated to 
obtain an indication of the variability of the probability distributions of Y. In addition, as 
we shall see in the next chapter, a variety of inferences concerning the regression function 
and the prediction of Y require an estimate of a 2 . 

Point Estimator of cr 2 

To lay the basis for developing an estimator of a 2 for regression model (1.1)，we first 
consider the simpler problem of sampling from a single population. 

Single Population. We know that the variance a 2 of a single population is estimated by 
the sample variance s 2 . In obtaining the sample variance s 2 , we consider the deviation of 




Chapter 1 Linear Regression with One Predictor Variable 25 


an observation F ； from the estimated mean F, square it, and then sum all such squared 
deviations: 


Example 


- f ) 2 

i=l 


Such a sum is called a sum of squares. The sum of squares is then divided by the degrees 
of freedom associated with it. This number isn — 1 here, because one degree of freedom is 
lost by using Y as an estimate of the unknown population mean \x. The resulting estimator 
is the usual sample variance: 

EL (u 2 

S = - 

n — l 


which is an unbiased estimator of the variance cr 2 of an infinite population. The sample 
variance is often called a mean square, because a sum of squares has been divided by the 
appropriate number of degrees of freedom. ^ 


Regression Model. The logic of developing an estimator of a 2 for the regression model is 
the same as for sampling from a single population. Recall in this connection from (1.4) that 
the variance of each observation Y ( for regression model (1 _ 1) is cr 2 ， the same as that of each 
error temi e ( -. We again need to calculate a sum of squared deviations, but must recognize 
that the Yi now come from different probability distributions with different means that 
depend upon the level X t . Thus, the deviation of an observation F,- must be calculated 
around its own estimated mean f；. Hence, the deviations are the residuals: 

Yi - Yi = e ； 

and the appropriate sum of squares, denoted by SSE, is: 


SSE = - Yd 2 = Vef 


( 1 . 21 ) 


where SSE stands for error sum of squares or residual sum of squares. 

The sum of squares SSE has n — 2 degrees of freedom associated with it. Two degrees 
of freedom are lost because both j6 0 and 仇 had to be estimated in obtaining the estimated 
means F,-• Hence, the appropriate mean square, denoted by MSE or s 2 , is: 


s 1 =MSE = 


SSE 


E ⑺-丈 ) 2 


( 1 . 22 ) 


n — 2 n — 2 n — 2 

where MSE stands for error mean square or residual mean square. , 

It can be shown that MSE is an unbiased estimator of o 1 for r^ression model (1.1): 

^{MSE} = a 2 (1.23) 


An estimator of the standard deviation a is simply j = ^/MSE, the positive square root of 
MSE. 


We will calculate SSE for the Toluca Company example" by (1.21). The residuals were 
obtained earlier in Table 1.2, column 4. This table also shows the squared residuals in 
column 5. From these results, we obtain: 

SSE = 54,825 



26 Part One Simple Linear Regression 


Since 25 — 2 = 23 degrees of freedom are associated with SSE, we find: 


s 2 = MSE = 


54,825 

23 


= 2,384 


Finally, a point estimate of a, the standard deviation of the probability distribution of Y for 
any X/iss = a/ 2,384 = 48.8 hours. 

Consider again the case where the lot size is X = 65 units. We found earlier that the 
mean of the probability distribution of y for this lot size is estimated to be 294.4 hours. 
Now, we have the additional information that the standard deviation of this distribution is 
estimated to be 48.8 hours. This estimate is shown in the MINITAB output in Figure 1.11 ， 
labeled as s. We see that the variation in work hours from lot to lot for lots of 65 units is 
quite substantial (49 hours) compared to the mean of the distribution (294 hours). 


1.8 Normal Error Regression Model_ 

No matter what maybe the form of the distribution of the error terms 。 (and hence of the 
Yi), the least squares method provides unbiased point estimators of j6o and that have 
minimum variance among all unbiased linear estimators. To set up interval estimates and 
make tests, however, we need to make an assumption about the form of the distribution of 
the £i. The standard assumption is that the error tenns e t are normally distributed, and we 
will adopt it here. A normal error term greatly simplifies the theory of regression analysis 
and, as we shall explain shortly，is justifiable in many real-world situations where regression 
analysis is applied. 

Model 

The normal error regression model is as follows: 

Y l =^o^^X i +e i (1.24) 

where: 

Yi is the observed response in the ith trial 

Xf is a known constant, the level of the predictor variable in the ith trial 
and Pi are parameters 
Si are independent iV(0, a 2 ) 
i = l,..., n 


Comments 

1. The symbol N(0 t a 2 ) stands for normally distributed, with mean 0 and variance a 2 . 

2. The normal error model (1.24) is the same as regression model (1.1) with unspecified error 
distribution, except that model (1.24) assumes that the errors s ( are normally distributed. 

3. Because regression model (1.24) assumes that the errors are normally distributed, the assump¬ 
tion of uncorrelatedness of the & in regression model (1.1) becomes one of independence in the 
normal error model. Hence, the outcome in any one trial has no effect on the error term for any other 
trial — as to whether it is positive or negative, small or large. 



Chapter 1 Linear Regression with One Predictor Variable 27 


4. Regression model (1.24) implies that the Y { are independent normal random variables, with 
mean E{Yi )= 知 + and variance cr 2 . Figure 1.6 pictures this normal error model. Each of the 
probability distributions of Y in Figure 1.6 is normally distributed, with constant variability, and the 
regression function is linear. 

5. The normality assumption for the error terms is justifiable in many situations because the error 
terms frequently represent the effects of factors omitted from the model that affect the response to 
some extent and that vary at random without reference to the variable X. For instance, in the Toluca 
Company example, the effects of such factors as time lapse since the last production run, particular 
machines used, season of the year, and personnel employed could vary more or less at random from 
run to run, independent of lot size. Also, there might be random measurement errors in the recording 
of Y, the hours required. Insofar as these random effects have a degree of mutual independence, the 
composite error term frepresenting all these factors would tend to comply with the central limit 
theorem and the error term distribution would 叩 proach normality as the number of factor effects 
becomes large. 

A second reason why the normality assumption of the error terms is frequently justifiable is that 
the estimation and testing procedures to be discussed in the next chapter are based on the t distribution 
and are usually only sensitive to large departures from normality. Thus, unless the departures from 
normality are serious, particularly with respect to skewness, the actual confidence coefficients and 
risks of errors will be close to the levels for exact normality. ■ 

Estimation of Parameters by Method of Maximum Likelihood 

When the functional form of the probability distribution of the error terms is specified, 
estimators of the parameters jS 。， A, and a 2 can be obtained by the method of maximum 
likelihood. Essentially, the method of maximum likelihood chooses as estimates those values 
of the parameters that are most consistent with the sample data. We explain the method of 
maximum likelihood first for the simple case when a single population with one parameter 
is sampled. Then we explain this method for regression models. 

Single Population. Consider a normal population whose standard deviation is known 
to be a = 10 and whose mean is unknown. A random sample of n = 3 observations is 
selected from the population and yields the results Y\ = 250, Y 2 = 265, = 259. We 

now wish to ascertain which value of fx is most consistent with the sample data. Consider 
fji = 230. Figure 1.13a shows the normal distribution with jx = 230 and a = 10; also shown 
there are the locations of the three sample observations. Note that the sample observations 

FIGURE 1.13 
Dei^tLes for 
Sample 
Observations 
for Two 
Possible Values 
of ft: F, = 250, 

Fj = 265, 

Y 3 ~ 259. 


(b) 



(a) 



28 Part One Simple Linear Regression 


would be in the right tail of the distribution if (jl were equal to 230. Since these are unlikely 
occurrences, {jl = 230 is not consistent with the sample data. 

Figure 1.13b shows the population and the locations of the sample data if fi were equal 
to 259. Now the observations would be in the center of the distribution and much more 
likely. Hence, jx = 259 is more consistent with the sample data than ji = 230. 

The method of maximum likelihood uses the density of the probability distribution at 
Y t (i.e” the height of the curve at 1}) as a measure of consistency for the observation Yi ， 
Consider observation Y { in our example. If Y\ is in the tail, as in Figure 1.13a, the height of 
the curve will be small. If Y { is nearer to the center of the distribution, as in Figure 1.13b, 
the height will be larger. Using the density function for a normal probability distribution 
in (A.34) in Appendix A, we find the densities for Fj, denoted by fi, for the two cases of 
fji in Figure 1.13 as follows: v 


fJi : 230: fi 


M = 259: 


1 


, —— - exp 

v^(io) 

/l = ^i exp 


1 ^250 - 230 y* 
2 \ 一 10 — ). 

1 /250 -259\ 2 ] 


.005399 


The densities for all three sample observations for the two cases of 从 are as follows: 



fji = 230 

^ = 259 

尸 1 

.005399 

.026609 

h 

•000087 

.033322 

h 

.000595 

.039894 


The method of maximum likelihood uses the product of the densities (i.e., here, the 
product of the three heights) as the measure of consistency of the parameter value with 
the sample data. The product is called the likelihood value of the parameter value fi and 
is denoted by If the value of fi is consistent with the sample data, the densities will 
be relatively large and so will be the product (i.e., the likelihood value). If the value of fi 
is not consistent with the data, the densities will be small and the product L(fi) will be 
small. 

For our simple example, the likelihood values are as follows for the two cases of \x\ 

LQi = 230) = .005399(.000087)(.000595) = .279 x 1(T 9 
L(fi = 259) = .026609(.033322)(.039894) = .0000354 

Since the likelihood value L{fji = 230) is a very small number, it is shown in scientific 
notation, which indicates that there are nine zeros after the decimal place before 279. Note 
that L{jji = 230) is much smaller than L(ji = 259), indicating that fi = 259 is much more 
consistent with the sample data than fi = 230. 

The method of maximum likelihood chooses as the maximum likelihood estimate that 
value of ju- for which the likelihood value is largest. Just as for the method of least squares, 



Chapter 1 Linear Regression with One Predictor Variable 29 


there are two methods of finding maximum likelihood estimates: by a systematic numerical 
search and by use of an analytical solution. For some problems, analytical solutions for the 
maximum likelihood estimators are available. For others, a computerized numerical search 
must be conducted. 

For our example, an analytical solution is available. It can be shown that for a normal 
population the maximum likelihood estimator of (jl is the sample mean Y. In our example, 
Y = 258 and the maximum likelihood estimate of fi therefore is 258. The likelihood value 
of g = 258 is L{fji = 258) = .0000359, which is slightly larger than the likelihood value 
of .0000354 for fi = 259 that we had calculated earlier. 

The product of the densities viewed as a function of the unknown parameters is called 
the likelihood function. For our example, where o = 10, the likelihood function is: 


L(^)= 



, _ _ exp 

▲(10)」^ 


1 /250- 

10 ). 


e xp 


x exp 


1 / 259 ~^\ 2 ' 
21 10 ) . 


1 (265 

H io 入 


Figure 1.14 shows a computer plot of the likelihood function for our example. It is based 
on the calculation of likelihood values L{fji) for many values of {jl. Note that the likelihood 
values 2 l\. fji = 230 and fi = 259 correspond to the ones we determined earlier. Also note 
that the likelihood function reaches a maximum at ju- = 258. 

The fact that the likelihood function in Figure 1.14 is relatively peaked in the neigh¬ 
borhood of the maximum likelihood estimate Y =258 is of particular interest. Note, for 
instance, that for fi = 250 or fi = 266, the likelihood value is already only a little more 
than one-half as large as the likelihood value at jit = 258. This indicates that the max¬ 
imum likelihood estimate here is relatively precise because values of 从 not near the maxi¬ 
mum likelihood estimate f = 258 are much less consistent with the sample data. When the 
likelihood function is relatively fiat in a fairly wide region around the maximum likelihood 


FIGURE 1.14 
Likelihood 
Function for 
Estimation of 
Mean of 
Normal 
Population: 

Yi = 250, 
y 2 = 265, 
6 = 259 . 




30 Part One Simple Linear Regression 


estimate, many values of the parameter are almost as consistent with the sample data as the 
maximum likelihood estimate, and the maximum likelihood estimate would therefore be 
relatively imprecise. 

Regression Model. The concepts just presented for maximum likelihood estimation of 
a population mean carry over directly to the estimation of the parameters of normal error 
regression model (1.24). For this model, each Y- t observation is normally distributed with 
mean po + and standard deviation a. To illustrate the method of maximum likelihood 
estimation here, consider the earlier persistence study example on page 15. For simplicity, 
let us suppose that we know a = 2.5. We wish to determine the likelihood value for the 
parameter values A) = 0 and = .5. For subject 1 , =20 and hence the mean of the 

probability distribution would be + / 61 X 1 = 0 + .5(20) = 10.0. Figyre 1.15a shows 
the normal distribution with mean 10.0 and standard deviation 2.5. Note that the observed 
value l"i = 5 is in the left tail of the distribution and that the density there is relatively small. 
For the second subject, X 2 = 55 and hence 你 + ^ 1 X 2 = 27.5. The normal distribution with 
mean 27.5 is shown in Figure 1.15b. Note that the observed value K 2 = 12 is most unlikely 
for this case and that the density there is extremely small. Finally, note that the dDserveci 
value F 3 = 10 is also in the left tail of its distribution if 仇 = 0 and — .5, as shown in 
Figure 1.15c, and that the density there is also relatively small. 


FIGURE 1.15 Densities for Sample Observations if ^ = 0 and jSi — 5 一 Persistence Study Example. 


(a) 


(b) 


(c) 


X] = 20, Kt = 5 X 2 = 55, V 2 = 12 X 3 = 30, Y 3 '= 10 

(3q + jS-]X-] = .5(20) = 10 衫 0 + jS! 入 2 = -5(55) = 27.5 + jS 】 入 3 = .5(30) = 15 



(d) Combined Presentation 




Chapter 1 Linear Regression with One Predictor Variable 31 


Figure 1.15d combines all of this information, showing the regression function E{Y )= 
0 + .5JC ，the three sample cases, and the three normal distributions. Note how poorly the 
regression line fits the three sample cases，as was also indicated by the three small density 
values. Thus, it appears that jSo = 0 and J3i = .5 are not consistent with the data. 

We calculate the densities (i.e., heights of the curve) in the usual way. For Y { = 5, 
X, =20, the normal density is as follows when = 0 and j3i = .5: 


fi 


V^r(2.5) 


exp 


1 — 10 . 0 ' 


2 -i 


2 V 2.5 


.=.021596 


The other densities are /2 — .7175 x 10 -9 and / 3 = .021596, and the likelihood value of 
仇 = 0 and ^ = .5 therefore is: 


v. 


LWo = 0, A = .5) = .021596(.7175 x 10 -9 ) (.021596) = .3346 x 10 ~ 12 

In general, the density of an observation Yi for the normal error regression model (1.24) 
is as follows, utilizing the fact that E{Yi} = + jSiX ； and a 2 {Yi} = a 2 : 


_p 1 「 1 f 1 — 如 — AX ； ^ 2 ] 

ft = - 7 ==-exp -- - (1 25) 

V 2 jrc 2 \ a } 

The likelihood function for n observations Y { , h， …， L is the product of the individual 
densities in (1.25). Since the variance cr 2 of the error terms is usually unknown, the likelihood 
function is a function of three parameters, j 6 0 » and cr 2 : 


LWo,^^ 2 ) = f[ 


(2jT<7 2 ) 1 / 2 


exp 


— 2^2 ^ — PlXiY 


(2 丌 <7 2 )"/ 2 


exp 


2a 2 


— A) — 


(1.26) 


The values of ^ 0 , and a 2 that maximize this likelihood function are the maximum 
likelihood estimators and are denoted by H and a 2 , respectively. These estimators can 
be found analytically, and they are as follows: 


Parameter 

Maximum Likelihood Estimator 

A) 

i Po = 

=bo same as (1.10b) 


)6i : 

= b] same as (1.10a) 

cr 2 

a 2 = 

E(n-) 2 



n 


(1.27) 


Thus, the maximum likelihood estimators of j 6 0 and^i are the same estimators as those 
provided by the method of least squares. The maximum likelihood estimator a 2 is biased, 
and ordinarily the unbiased estimator MSE as given in (1.22) is used. Note that the unbi¬ 
ased estimator MSE or s 2 differs but slightly from the maximum likelihood estimator a 2 . 



32 Part One Simple Linear Regression 


especially if n is not small: 


s 2 = MSE = 


n 


n 


2 




( 128 ) 


Example 


For the persistence study example, we know now that the maximum likelihood estimates of 
Po and ^ are Z?o = 2.81 and b x =. 177, the same as the least squares estimates in Figure 1.9b. 


Comments 

1. Since the maximum likelihood estimators 0 Q and are the same as the least squares estimators 
bo and b t ， they have the properties of all least squares estimators: 

a. They are unbiased. 

b. They have minimum variance among all unbiased linear estimators. 

In addition, the maximum likelihood estimators and b x for the normal error regression model 
(1.24) have other desirable properties: 

c. They are consistent, as defined in (A.52). 

d. They are sufficient，as defined in (A.53). 

e. They are minimum variance unbiased; that is, they have minimum variance in the class of all 
unbiased estimators (linear or otherwise). 

Thus, for the normal error model, the estimators and b\ have many desirable properties. 

2. We find the values of fio, j6 r , and a 2 that maximize the likelihood function L in (1.26) by taking 
partial derivatives of L with respect to jg,, and a 2 , equating each of the partials to zero, and 
solving the system of equations thus obtained. We can work with log e L, rather than L, because 
both L and log c L are maximized for the same values of /5q ，and a 2 : 


log/ 




log e 27T ™ ™log e a 2 — —go — P\^i) 2 (1.29) 


Partial differentiation of the logarithm of the likelihood function is much easier; it yields: 


9(log e L) 

明 0 

8(log e L) 

— 冲 ' 

B(log e L) 


o 2 


〉 ： (K — A) — j6|X,) 




We now set these partial derivatives equal to zero, replacing ^ r , and cr 2 by the estimators 0o, 
and & 2 . We obtain, after some simplification: 


〉 - 0o - 0i^i )= 

= 0 

(1.30a) 

〉:— 々 o — 0\X-,) - 

= 0 

(1.30b) 

E(^- -0o~ 艮 x;) 2 

= & 2 

(1.30c) 



Chapter 1 Linear Regression with One Predictor Variable 33 


Formulas (1.30a) and (1.30b) are identical to the earlier least squares normal equations (1.9), and 
formula (1.30c) is the biased estimator of <j 2 given earlier in (1.27). ■ 


1.1. BMDP New System 2.0. Statistical Solutions, Inc. 

1.2. MINITAB Release 13. Minitab Inc. 

1.3. SAS/STAT Release 8.2. SAS Institute, Inc. 

1.4. SPSS 11.5 for Windows. SPSS Inc. 

1.5. SYSTAT 10.2. SYSTAT Software, Inc. 

1.6. JMP Version 5. SAS Institute, Inc. 

1.7. S-Plus 6 for Windows. Insightful Corporation. 

1.8. MATLAB 6.5. The MathWorks, Inc. 


1.1. Refer to the sales volume example on page 3. Suppose that the number of units sold is measured 
accurately, bur clerical errors are frequently made in determining the dollar sales. Would the 
relation between the number of units sold and dollar sales still be a functional one? Discuss. 

1.2. The members of a health spa pay annual membership dues of $300 plus a charge of $2 for each 
visit to the spa. Let Y denote the dollar cost for the year for a member and X the number of 
visits by the member during the year. Express the relation between X and Y mathematically. 
Is it a functional relation or a statistical relation? 

1.3. Experience with a certain type of plastic indicates that a relation exists between the hardness 
(measured in Brinell units) of items molded from the plastic (7) and the elapsed time since ter¬ 
mination of the molding process (X). It is proposed to study this relation by means of r^ression 
analysis. A participant in the discussion objects, pointing out that the hardening of the plastic 
“is the result of a natural chemical process that doesn’t leave anything to chance, so the relation 
must be mathematical and regression analysis is not appropriate." Evaluate this objection. 

1.4. In Table 1.1, the lot size X is the same in production runs 1 and 24 but the work hours Y differ. 
What feature of regression model (1.1) is illustrated by this? 

1.5. When asked to state the simple linear regression model, a student wrote it as follows: E{Yi)= 
fio + PiX ； + £t ，Do you agree? 

1.6. Consider the normal error regression model (1.24). Suppose that the parameter value! are 

= 200, Pi = 5.0, and o* = 4. — 

sl Plot this normal error regression model in the fashion of Figure 1.6. Show the distributions 
of y for X 二 10, 20, and 40. 

m 

b. Explain the meaning of the parameters and 氏 ‘ Assume that the scope of the model 
includes X = 0. 

1.7. In a simulation exercise，regression model (1.1) 叩 pMes with j6o = 100, 爲 = 20, and a 2 = 25. 
An observation on Y will be made for X = 5. 

a. Can you state the exact probability that Y will fall between 195 and 205? Explain. 

b. If the normal error regression model (1.24) is applicable, can you now state the exact prob¬ 
ability that Y will fall between 195 and 205? If so, state it. 

1.8. In Figure 1.6, suppose another Y observation is obtained at X = 45. Would E{Y) for this new 
observation still be 104? Would the Y value for this new case again be 108? 

1.9. A student in accounting enthusiastically declared: “Regression is a very powerful tool. We can 
isolate fixed and variable costs by fitting a linear regression model, even when we have no data 
for small lots.” Discuss. 



34 Part One Simple Linear Regression 


1.10. An analyst in a large corporation studied the relation between current annual salary (F) and 
age (X) for the 46 computer programmers presently employed in the company. The analyst 
concluded that the relation is curvilinear, reaching a maximum at 47 years. Does this imply 
thar the salary for a programmer increases until age 47 and rfien decreases? Explain. 

1.11. The regression function relating production output by an employee after taking a training 
program (K) to the production output before the training program (X) is £{^} = 20+ .95X, 
where X ranges from 40 to 100. An observer concludes that the training program does not raise 
production output on the average because is not greater than 1.0. Comment. 

1.12. In a study of the relationship for senior citizens between physical activity and frequency of 
colds, participants were asked to monitor their weekly time spent in exercise over a five-year 
period and the frequency of colds. The study demonstrated that a negative statistical relation 
exists between time spent in exercise and frequency of colds. The investigator concluded that 
increasing the time spent in exercise is an effective strategy for reducing the frequency of colds 
for senior citizens. 

a. Were the data obtained in the study observational or experimental data? 

b. Comment on the validity of the conclusions reached by the investigator. 

c. Identify two or three other explanatory variables that might affect both the time spent in 
exercise and the frequency of colds for senior citizens simultaneously. 

d. How might the study be changed so that a valid conclusion about causal relationship between 
amount of exercise and frequency of colds can be reached? 

1.13. Computer programmers employed by a software developer were asked to participate in a month¬ 
long training seminar. During the seminar, each employee was asked to record the number of 
hours spent in class Reparation each week. After completing the seminar, the productivity level 
of each participant was measured. A positive linear statistical relationship between participants’ 
productivity levels and time spent in class preparation was found. The seminar leader concluded 
that increases in employee productivity are caused by increased class prqjaration time. 

a. Were the data used by the seminar leader observational or experimental data? 

b. Comment on the validity of the conclusion reached by the seminar leader. 

c. Identify two or three alternative variables that might cause both the employee productivity 
scores and the employee class participation times to increase (decrease) simultaneously. 

d. How might the study be changed so that a valid conclusion about causal relationship between 
class preparation time and employee productivity can be reached? 

1.14. Refer to Problem 1.3. Four different elapsed times since termination of the molding process 
(treatments) are to be studied to see how they affect the hardness of a plastic. Sixteen batches 
(experimental units) are available for the study. Each treatment is to be assigned to four exper¬ 
imental units selected at random. Use a table of random digits or a random number generator 
to make an appropriate randomization of assignments. 

1.15. The effects of five dose levels are to be studied in a completely randomized design, and 20 
experimental units are available. Each dose level is to be assigned ro four experimental units 
selected at random. Use a table of random digits or a random number generator to make an 
appropriate randomization of assignments. 

1.16. Evaluate the following statement: “For the least squares method to be fully valid, it is required 
that the distribution of Y be normal.” 

1.17. A person states that bo and b\ in the fitted regression function (1.13) can be estimated by the 
method of least squares. Comment. 

1.18. According to (1.17), Yl e i 二 0 when r^ression model (3.1) is fitted to a set of n cases by the 
method of least squares. Is it also true that ^ e ； = 0? Comment. 



Chapter 1 Linear Regression with One Predictor Variable 35 


1.19. Grade point average. The director of admissions of a small college selected 120 students at 
random from the new freshman class in a study to determine whether a student’s grade point 
average (GPA) at the end of the freshman year (V) can be predicted from the ACT test score (X). 
The results of the study follow. Assume that first-order regression model (1.1) is appropriate. 


/: 

1 

2 

3 ... 

118 

119 

120 

X /： 

Yu 

21 

3.897 

14 

3.885 

28 ... 

3.778 ... 

28 

3.914 

16 

1.860 

28 

2.948 


a. Obtain the least squares estimates of j3o and jSi, and state the estimated regression function. 

b. Plot the estimated regression function and the data.>Does the estimated r^ression function 
appear to fit the data well? 

c. Obtain a point estimate of the mean freshman GPA for students with ACT test score X = 30. 

d. What is the point estimate of the change in the mean response when the entrance test ^core 
increases by one point? 

* 1.20. Copier maintenance. The Tri-City Office Equipment Corporation sells an imported copier on 
a franchise basis and performs preventive maintenance and repair service on this copier. The 
data below have been collected from 45 recent calls on users to perform routine preventive 
maintenance service; for each call, X is the number of copiers serviced and Y is the total 
number of minutes spent by the service person. Assume that first-order regression model (1.1) 
is appropriate. 


f: 1_ 2 3 ... 43 44 45 

Xr. 2 4 3 2 4 5 

K /： 20 60 46 ... 27 61 77 


a. Obtain the estimated regression function. 

b. Plot the estimated regression function and the data. How well does the estimated regression 
function fit the data? 

c. Interpret b 0 in your estimated regression function. Does bo provide any relevant information 
here? Explain. 

d. Obtain a point estimate of the mean service time when X = 5 copiers are serviced. 

*1.21. Airfreight breakage. A substance used in biological and medical research is shipped by air¬ 
freight to users in cartons of 1,000 ampules. The data below, involving 10 shipments, were 
collected on the number of times the carton was transferred from one aircraft to another over 
the shipment route (X) and the number of ampules found to be broken upon arrival (K). Assume 
that first-order regression model (1.1) is appropriate. 


f: 1 2 3^ 4 5 6 7 8 9 10 

X,: 1 0 2 0 3 1 0 1 2 0 

Yr. 16 9 17 12 22 13 8 15 19 11 


a. Obtain the estimated regression function. Plot the estimated r^ression function and the 
data. Does a linear regression function appear to give a good fit here? 

b. Obtain a point estimate of the expected number of broken ampules when X = 1 transfer is 
made. 



36 Part One Simple Linear Regression 


c. Estimate the increase in the expected number of ampules broken wiien there are 2 transfers 
as compared to 1 transfer. 

d. Verify that your fitted regression line goes through the point (X, V). 

1.22. Plastic hardness. Refer to Problems 1.3 and 1.14. Sixteen batches of the plastic were made, 
and from each batch one test item was molded. Each test item was randomly assigned to one of 
the four predetermined time levels, and the hardness was measured after the assigned elapsed 
time. The results are shown below; X is the elapsed time in hours, and Y is hardness in Brinell 
units. Assume that first-order regression model (1.1) is appropriate. 


i: 

1 

2 

3 

... 14 

15 

16 

x ；： 

16 

16 

16 

... 40 

40 

40 

Yi ： 

199 

205 

196 

... 248 

253 

246 


a. Obtain the estimated regression function. Plot the estimated regression function and the 
data. Does a linear regression function appear to give a good fit here? 

b. Obtain a point estimate of the mean hardness when X = 40 hours. 

c. Obtain a point estimate of the change in mean handness when X increases by 1 hour. 

1.23. Refer to Grade point average Problem 1.19. 

a. Obtain the residuals e,-. Do they sum to zero in accord with (1.17)? 

b. Estimate a 2 and a. In what units is a expressed? 

*1.24. Refer to Copier maintenance Problem 1.20. 

a. Obtain the residuals e { and the sum of the squared residuals ^ ef. What is the relation 
between the sum of the squared residuals here and the quantity Q in (1.8)? 

b. Obtain point estimates of < j 2 and cr. In what units is a expressed? 

*1.25. Refer to Airfreight breakage Problem 1.21. 

a. Obtain the residual for the first case. What is its relation to e r ? 

b. Compute ^ ef and MSE. What is estimated by MSE? 

1.26. Refer to Plastic hardness Problem 1.22. 

a. Obtain the residuals e,-. Do they sum to zero in accord with (1.17)? 

b. Estimate < j 2 and cr. In what units is a expressed? 

*1.27. Muscle mass. A person’s muscle mass is expected to decrease with age. To explore this rela¬ 
tionship in women , 过 nutritionist randomly selected 15 women from each 10-year age group, 
beginning with age 40 and ending with age 79. The results follow; X is age, and y is a measure 
of muscle mass. Assume that first-order regression model (1.1) is appropriate. 


f ： 

1 

2 

3 

... 58 

59 

60 

X /： 

43 

41 

47 

... 76 

72 

76 

Yf ： 

106 

106 

97 

… 56 

70 

74 


a. Obtain the estimated regression function. Plot the estimated r^ression function and the data. 
Does a linear regression function appear to give a good fit here? Does your plot support the 
anticipation that muscle mass decreases with age? 

b. Obtain the following: (1) a point estimate of the difference in the mean muscle mass for 
women differing in age by one year, (2) a point estimate of the mean muscle mass for women 
aged X = 60 years, (3) the value of the residual for the eighth case, (4) a point estimate of a 2 . 



Chapter 1 Linear Regression with One Predictor Variable 37 


1.28. Crime rate. A criminologist studying the relationship between level of education-and crime 
rate in medium-sized U.S. counties collected the following data for a random sample of 84 coun¬ 
ties; X is the percentage of individuals in the county having at least a high-school diploma, and 
Y is the crime rate (crimes reported per 100,000 residents) last year. Assume that first-order 
regression model ( 1 . 1 ) is appropriate. 

f: 1_2_3 82_83_84_ 

X /： 74 82 81 ... 88 83 76 

Y t : 8,487 8,179 8,362 ... 8,040 6,981 7,582 

a. Obtain the estimated regression function. Plot the estimated regression function and the 
data. Does the linear regression function appear to give a good fit here? Discuss. 

b. Obtain point estimates of the following ： (1) the difference in the mean crime rate for two 
counties whose high-school graduation rates differ by one percentage point, ( 2 ) the mean 
crime rate last year in counties with high school graduation percentage X = 80, (3) ei 0 , 
⑷ a 2 . 


Exercises 


1.32. Derive the expression for b\ in (1.10a) from the normal equations in (1.9). 

1.33. (Calculus needed.) Refer^ro the regression model F/ = ) 6 o + £/ in Exercise 1.30. Derive the 
least squares estimator of for this model. 

1.34. Prove that the least squares estimator of obtained in Exercise 1.33 is unbiased. 

1.35. Prove the result in (1.18) — that the sum of the Y observations is the same as the sum of the 
fitted values. 

1.36. Prove the result in (1.20) — that the sum of the residuals weighted by the fitted values is zero. 

1.37. Refer to Table 1.1 for the Toluca Company example. When asked to present a point estimate 
of the expected work hours for lot sizes of 30 pieces, a person gave the estimate 202 because 
this is the mean number of work hours in the three—runs of size 30 in the study. A critic states 
that this person’s approach “throws away” most of the data in the study because cases with lot 
sizes other than 30 are ignored. Comment. 

1.38. In Airfreight breakage Problem 1.21, the least squares estimates are b 0 = 10.20 andt^ = 4.00, 

and ^2 ef = 17.60. Evaluate the least squares criterion Q in (1.8) for the estimates (1) £? 0 = 9, 
b\ = 3; (2) bQ = \\, — 5. Is the criterion Q laiger for these estimates than for the least squares 

estimates? 

1.39. Two observations on Y were obtained at each of three X levels, namely, at X = 5,X= 10, and 
X= 15. 

a. Show that the least squares regression line fitted to the three points (5, ?i), (10, F 2 ), and 
(15, F 3 ), where Yi, V 2 , and % denote the means of the Y observations at the three X levels, 
is identical to the least squares regression line fitted to the original six cases. 


1.29. Refer to regression model (1.1). Assume that X = 0 is within the scope of the model. What is 
the implication for the regression function if = 0 so that the model is Y ; = piXi + e,-? How 
would the regression function plot on a graph? 

1.30. Refer to r^ression model (1.1). What is the implication for the r^ression function if ^ = 0 
so that the model is K + e；? How would the r^ression function plot on a graph? 

1.31. Refer to Plastic hardness Problem 1.22. Suppose one test item was molded from a single 
batch of plastic and the hardness of this one item was measured at 16 different points in time. 
Would the error term in the regression model for this case still reflect the same effects as for 
the experiment initially described? Would you expect the error terms for the different points in 
time to be uncorrelated? Discuss. 



38 Part One Simple Linear Regression 


b. In this study, could the errca- term variance <j 2 be estimated without fitting a r^ression line? 
Explain. 

1.40. In fitting regression model (1.1), it was found that observation Y t fell directly on the fitted 
regression line (i.e., Y( = y, ). If this case were deleted, would the least squares r^ression line 
fitted to the remaining n — 1 cases be changed? [Hint: What is the contribution of case i to the 
least squares criterion Q in (1.8)?] 

1.41. (Calculus needed.) Refer to the regression model Y ； — / = 1, …，《， in Exercise 1.29. 

a. Find the least squares estimator of ^. 

b. Assume that the error terms e,- are independent iV(0, a 2 ) and that a 2 is known. State the 
likelihood function for the n sample observations on Y and obtain the maximum likelihood 
estimator of pI. Is it the same as the least squares estimator? 

c. Show that the maximum likelihood estimate of & is unbiased. “ 

1.42. Typographical errors. Shown below are the number of galleys for a manuscript (X) and the 

dollar cost of correcting typographical errors (F) in a random sample of recent orders handled by 
a firm specializing in technical manuscripts. Assume that the regression model — X t + 

is appropriate, with normally distributed independent error terms whose variance is a 2 =■ 16. 

_/:_1_2_3_4_5_ 6_ 

X } : 7 12 4 14 25 30 

Yj ： 128 213 75 250 446 540 

a. State the likelihood function for the six Y observations, for <j 2 = 16. 

b. Evaluate the likelihood function for ^ = 17, 18, and 19. For which of these 氏 values is 
the likelihood function largest? 

c. The maximum likelihood estimator is ^ ^ Xf. Find the maximum likelihood 

estimate. Are your results in part (b) consistent with this estimate? 

d. Using a computer graphics ot statistics package, evaluate the likelihood function for values 
of between ^=11 and = 19 and plot the function. Does the point at which the 
likelihood function is maximized correspond to the maximum likelihood estimate found in 
part (c)? 


Projects 1.43. Refer to the CDI data set in Appendix C.2. The number of active physicians in a CDI (7) is 

expected to be related to total population, number of hospital beds, and total personal income. 

Assume that first-order regression model (1.1) is appropriate for each of the three predictor 

variables. 

a. Regress the number of active physicians in turn on each of the three predictor variables. 
State the estimated regression functions. 

b. Plot the three estimated r^ression functions and data on separate graphs. Does a linear 
regression relation appear to provide a good fit for each of the three predictor variables? 

c. Calculate MSE for each of the three predictor variables. Which predictor variable leads to 
the smallest variability around the fitted regression line? 

1.44. Refer to the CDI data set in Appendix C.2. 

a. For each geographic r 翠 ion, regress per capita income m a CDI ⑺ against the per¬ 
centage of individuals in a county having at least a bachelor’s degree (X). Assume that 




Chapter 1 Linear Regression with One Predictor Variable 39 


first-order r^ression model (1.1) is appropriate for each r 翠 ion. State the estimated r 翠 res- 
sion functions. 

b. Are the estimated regression functions similar for the four regions? Discuss. 

c. Calculate MSE for each region. Is the variability around the fitted regression line approxi¬ 
mately the same for the four regions? Discuss. 

1.45. Refer to the SENIC data set in Appendix C.l. The average length of stay in a hospital (K) is 
anticipated to be related to infection risk, available facilities and services, and routine chest 
X-ray ratio. Assume that first-order regression model (1.1) is appropriate for each of the three 
predictor variables. 十 

a. Regress average length of stay on each of the three predictor variables. State the estimated 
regression functions. 

b. Plot the three estimated regression functions and data on separate graphs. Does a linear 
relation appear to provide a good fit for each of the three predictor variables? 

c. Calculate MSE for each of the three predictor variables. Which predictor variable leads to 
the smallest variability around the fitted regression line? 


1.46. Refer to the SENIC data set in Appendix C.l. 

a. For each geogr 叩 hie region, regress average length of stay in hospital (K) against infection 
risk (X). Assume that first-order regression model (1.1) is appropriate for each r^ion. State 
the estimated r^ression functions. 

b. Are the estimated r^ression functions similar for the four r 雙 ions? Discuss. 


c. Calculate MSE for each region. Is the variability around the fitted regression line approxi¬ 
mately the same for the four regions? Discuss. 

1.47. Refer to Typographical errors Problem 1.42. Assume that first-order regression model (1.1) 
is appropriate, with normally distributed independent error terms whose variance is o 1 — 16. 

a. State the likelihood function for the six observations, for <j 2 = 16. 

b. Obtain the maximum likelihood estimates of ^ and Pi, using (1.27). 

c. Using a computer graphics or statistics package, obtain a three-dimensional plot of the 
likelihood function for values of between = —10 and p 0 ~ 10 and for values of 

between 氏 =17 and = 19. Does the likelihood appear to be maximized by the 
maximum likelihood estimates found in part (b)? 



Chapter 


口 Inferences in Regression 
and Correlation Analysis 


In this chapter, we first take up inferences concerning- the regression parameters and 
considering both interval estimation of these parameters and tests about them. We then 
discuss interval estimation of the mean E{Y} of the probability distribution of Y, for given 
X, prediction intervals for a new observation Y, confidence bands for the regression line. 


the analysis of variance approach to regression analysis, the general linear test approach, 
and descriptive measures of association. Finally, we take up the correlation coefficient, a 
measure of association between X and Y when both X and F are random variables. 


Throughout this chapter (excluding Section 2.11), and in the remainder of Part I unless 
otherwise stated, we assume that the normal error regression model (1.24) is applicable. 
This model is: 


Y ； = + X[ + Ei 


( 2 . 1 ) 


where: 


A) and are parameters 
Xi are known constants 
ei axe independent N(Q, a 2 ) 


2.1 Inferences Concerning 

Frequently, we are interested in drawing inferences about 仇， the slope of the regression 
line in model (2.1). For instance, a market research analyst studying the relation between 
sales (F) and advertising expenditures (X) may wish to obtain an interval estimate of 私 
because it will provide information as to how many additional sales dollars, on the average, 
are generated by an additional dollar of advertising expenditure. 

At times, tests concerning are of interest, particularly one of the form: 


40 


Ho： A = 0 

H a : 





Chapter 2 Inferences in Regression and Correlation Analysis 41 


FIGURE 2.1 

Regression 
Model (2.1) 
when = 0. 



The reason for interest in testing whether or not = 0 is that, when 仇 = 0, there is no 
linear association between Y and X. Figure 2.1 illustrates the case when ^ — C^Note that 
the regression line is horizontal and that the means of the probability distributions of Y are 
therefore all equal, namely: 


E{Y} = j6 0 + (0)X = j6 0 


For normal error regression model (2.1), the condition — 0 implies even more than 
no linear association between Y and X. Since for this model all probability distributions of 

Y are normal with constant variance, and since the means are equal when ^ == 0, it follows 
that the probability distributions of Y are identical when = 0. This is shown in Figure 2.1. 
Thus, ^i=0 for the normal error regression model (2.1) implies not only that there is no 
linear association between Y and X but also that there is no relation of any type between 

Y and X, since the probability distributions of Y are then identical at all levels of X. 
Before discussing inferences concerning further, we need to consider the sampling 

distribution of b x , the point estimator of . 


Sampling Distribution of bi 

The point estimator b { was given in (1.10a) as follows: 


TXXi- X)(Y, - Y) 

^5 ： (U 2 


..( 2 . 2 ) 


The sampling distribution of b { refers to the different values of that would be obtained 
with repeated sampling when the levels of the predictor variable X are held constant from 
sample to sample. ? 

For normal error regression model (2.1), the sampling distribution . 

of b\ is normal, with mean and variance: 、 * * 


E{bi } = 


cr 2 {bi} — 


Ed - 旬 2 


P.3a) 

(2.3b) 


To show this, we need to recognize that b\ is a linear combination of the observations F/. 



42 Part One Simple Linear Regression 


b\ as Linear Combination of the Yi ， It can be shown that bi, as defined in (2.2), can be 
expressed as follows: 




(2.4) 

where: 

X/ — X 

E(u ) 2 

(2.4a) 

Observe that the ki are a function of the X,- and therefore are fixed quantities since the X ； 
are fixed. Hence, b x is a linear combination of the Y ； where the coeffici^tits are solely a 
function of the fixed X,-. 

The coefficients k t have a number of interesting properties that will be used later: 


= ° ■ 

(2.5) 


=i 

(2.6) 


^ ki ~ 足 ■ -x) 2 

(2.7) 


Comments 

1. To show that b\ is a linear combination of the Y ； with coefficients k t , we first prove: 

- XXYi -?) = - X)Yi ( 2 . 8 ) 

This follows since: 

- X){Y t -F) = -X)Y ( - -X)Y 

But JJ(Xi -X)Y = Y - X) = 0 since - 叉 ）= 0, Hence, (2.8) holds. 

We now express b\ using (2.8) and (2.4a): 

E(^z 叉 )5 -y) = = y^ kYi 

E(^ - ^) 2 E(x,-x) 2 乙 

2. The proofs of the properties of the ki are direct. For example, property (2.5) follows because: 

X 孟 = Ux!-X)^ z (x; _ 幻 = Z (二幻 2 = 0 

Similarly, property (2.7) follows because: 

= [ E(X ,.-x)f ^ x, ~ ^ )2= 

Normality. We return now to the sampling distribution of bi for the normal error regres¬ 
sion model (2.1). The normality of the sampling distribution of b x follows at once from the 
fact that is a linear combination of the Y t . The F/ are independently, normally distributed 



Chapter 2 Inferences in Regression and Correlation Analysis 43 


according to model (2.1), and (A.40) in Appendix A states that a linear combination of 
independent normal random variables is normally distributed. 

Mean. The unbiasedness of the point estimator b u stated earlier in the Gauss-Markov 
theorem (1.11)，is easy to show: 

Eib,} = = ^ 2 kiE[Yi ^ = + 


By (2.5) and (2.6), we then obtain E{b{\ = 


Variance. The variance of b\ can be derived readily. We need only remember that the 
Y t are independent random variables, eadi with variance cj 2 , and that the k t are constants. 
Hence, we obtain by (A.31): t 

<y 2 {b,} == ^2kja 2 [Y i } 

= ^2 k f al = a2 ^2 k f 
^2 1 
一 E(U 2 

The last step follows from (2.7). 


Estimated Variance. We can estimate the variance of the sampling distribution of bi ： 


<y 2 {bx} 




Ed — 奸 


by replacing the parameter a 2 with MSE, the unbiased estimator of a 2 : 


s 2 {h} 


MSE 


E ( 兄 -x ) 2 


(2.9) 


The point estimator isan unbiased estimator of a 2 { 办夏 }. Taking the positive square 

root, we obtain s[bi}, the point estimator of cr 此 }• 


Comment 

We stated in theorem (1.11) that bi has minimum variance among all unbiased linear estimators of 
the form: 

- 01 = c i Y i 

where the c t are arbitrary constants. We now prove this. Since is required to be unbiased, the 
following must hold: 

* ， 

E{0i) = =Y^c i E{Y i )^^ 

Now jEfK/} = + Pi^i by (1.2), so the above condition becomes: 

E 0 i) = Q()6o + PiXi) = )6o Ci + CiXi = )61 



44 Part One Simple Linear Regression 


For the unbiasedness condition to hold, the c t must follow the restrictions: 

q= o ciXi — i 

Now the variance of 0 1 is, by (A.31): 

cr 2 [p f ) = y^^(T 2 [Yj} =cr 2 y^cf 

Let us define c,‘ = & +d/,-, where the & are the least squares constants in (2.4a) and the di are arbitrary 
constants. We can then write: 

q- 2 ( ； 6i} = cr 2 y ^cf = a 2 + d：) 2 = cr 2 ^ + ^2 + 

We know that cr 2 ^ kf = <t 2 {£>|} from our proof above. Further，^ kidi = 0 because of the restrictions 
on the ki and 。 above: 

= y^Jci(Ci-ki) 

= ^2 c i k i - ^2 ^ 

= 5 >: 

_ 1 =() 

— Y：% -又 ) 2 足-叉 ) 2 _ 

Hence, we have: 

a 2 {0 l }=a 2 {b l }+a 2 ^2df 

Note that the smallest value of y^df is zero. Hence, the variance of is at a minimum when 
= 0. But this can only occur if all d t = 0, which implies c ( = k t . Thus, the least squares 
estimator b\ has minimum variance among all unbiased linear estimators. ■ 

Sampling Distribution of (bi — )3i)/s{foi} 

Since b x is normally distributed, we know that the standardized statistic {b\ — 
is a standard normal variable. Ordinarily, of course, we need to estimate o{bi\ by 
and hence are interested in the distribution of the statistic (bi — ^i)/s{bi}. When a statistic 
is standardized but the denominator is an estimated standard deviation rather than the true 
standard deviation, it is called a studentized statistic. An important theorem in statistics 
states the following about the studentized statistic 一 ^{)/s{b\]\ 

b\ _ 

— —is distributed as t (n — 2) for regression model (2.1) ( 2 . 10 ) 

咖 1 } 

Intuitively, this result should not be unexpected. We know that if the observations K 
come from the same normal population, {Y — fi)/s{Y} follows the f distribution with n — 1 
degrees of freedom. The estimator like Y, is a linear combination of the observations Yi. 
The reason for the difference in the degrees of freedom is that two parameters (卢。 and 卢 i) 
need to be estimated for the regression model; hence, two degrees of freedom are lost here. 


Xi-X 1 

E (足一幻 2 _ E(Xi - X) 2 



Chapter 2 Irtferences in Regression and Correlation Analysis 45 


Comment 

We can show that the studentized statistic (£>, — ^\)/s{b{\ is distributed as t with n — 2 degrees of 
freedom by relying on the following theorem: 


For regression model (2.1), SSE/a 2 is distributed as x 2 with n — 2 
degrees of freedom and is independent of b 0 and b\. 


( 2 . 11 ) 


First, let us rewrite {b\ — ^\)/s[b\} as follows: 

bi — A 二 啡 [} 
cr[bi) a[bi) 

The numerator is a standard normal variable z- The nature of the denominator can be seen by first 
considering: 

MSE SSE 

s 2 [bi) _ Ed _ 叉 ) 2 _ MSE — 

o 2 {b x ) ~ a 2 ~ a 2 ~ a 2 ^ 

E ( 足 ■- 幻 2 

_ SSE ~ xHn - 2) 

— 2) n - 2 

where the symbol 〜 stands for “is distributed as.” The last step follows from (2.11). Hence, we have: 

i>\ - 卢 i z 

_ .. 

lxHn-2) 

V n-2 

But by theorem (2.11), z and x 2 are independent since ? is a function of b\ and b\ is independent of 
SSE/a 2 ~ x 2 ' Hence, by (A.44), it follows that: 


i>i — 


~ t(n — 2 ) 


This result places us in a position to readily make inferences concerning 


U 


Confidence Interval for 

Since ibi — ^i)/s[bi) follows a t distribution, we can make the following probability 
statement: 


P{t{a/2\n — 2) < {b\ — ^\)/s[b\] <t{\ — a/2;n —2)} = \ — a (2.12) 

Here, n—2) denotes the (a/2) 100 percentile of the t distribution with n—2 degrees 
of freedom. Because of the symmetry of thef distribution around its mean 0, it follows that: 

t(a/2; n — Q) = — a/2; n—2) (2.13) 

Rearranging the inequalities in (2.12) and using (2.13), we obtain: 

P{bi — t{\ — a/2;n — 2)s{b\} < + 1{\ — a/2;n — 2)5(^!}} = \ — a 

* ^ P.14) 

Since (2.14) holds for all possible values of 仇， the 1 — « confidence limits for 办 are: 

bi ± t{\ — a/2; n — 2)^(^!} (2.15) 



46 Part One Simple Linear Regression 


Example 


TABLE 2.1 
Results for 
Toluca 
Company 
Example 
Obtained in 
Chapter 1. 

FIGURE 2.2 
Portion of 
MINITAB 


Regression 



Toluca 

Company 

Example. 


Consider the Toluca Company example of Chapter 1. Management wishes an estimate of 
Pi with 95 percent confidence coefficient. We summarize in Table 2.1 the needed results 
obtained earlier. First, we need to obtain *?{〜}: 


s 2 {bi) 


MSE — 
- X) 2 = 


2,384 

19,800 


=.12040 


= .3470 

This estimated standard deviation is shown in the MINITAB output in Figure 2.2 in the 
column labeled Stdev corresponding to the row labeled X. Figure 2.2 repeats the MINITAB 
output presented earlier in Chapter 1 and contains some additional results that we will utilize 
shortly. ^ 

For a 95 percent confidence coefficient, we require r (.975; 23). From Table B.^in Ap¬ 
pendix B, we find f (.975; 23) = 2.069. The 95 percent confidence interval, by (2.15), then is: 

3.5702 - 2.069(.3470) < ^ < 3.5702 + 2.069(.3470) 

2.85 < ^! < 4.29 


Thus, with confidence coefficient .95, we estimate that the mean number of work hours 
increases by somewhere between 2.85 and 4.29 hours for each additional unit in the lot. 


Comment 

In Chapter 1, we noted that the scope of a regression model is restricted ordinarily to some range of 
values of the predictor variable. This is particulariy important to keep in mind in using estimates of 
the slope In our Toluca Company example, a linear regression model appeared appropriate for 
lot sizes between 20 and 120, the range of the predictor variable in the recent past. It may not be 


n = 

= 25 

X = 

70.00 

bo = 

= 62.37 

bi = 

3.5702 

Y = 

= 62.37+ 3.5702X 

SSE = 

54,825 

5：( 入卜又 ) 2 二 

= 19,800 

MSE = 

2,384 

E(m 2 = 

= 307,203 




The regression equation is 
Y = 62.4 + 3.57 X 


Predictor 

Coef 

Stdev 

t-ratio 

P 

Constant 

62.37 

26.18 

2.38 

0.026 

X 

3.5702 

0•3470 

10.29 

0.000 

s = 48.82 

E-sq 

= 82.2% 

E-sq(adj)= 

81.4% 

Analysis of 

Variance 




SOURCE 

DF 

SS 

MS 

F p 

Regression 

1 

252378 

252378 105 

.88 0.000 

Error 

23 

54825 

2384 


Total 

24 

307203 





Chapter 2 Irtferences in Regression and Correlation Analysis 47 


reasonable to use the estimate of the slope to infer the effect of lot size on number of work hours far 
outside this range since the regression relation may not be linear there. ■ 


Tests Concerning 、 

Since (bi — ^\)/s{b\} is distributed as t with n —2 degrees of freedom, tests concerning 
can be set up in ordinary fashion using the t distribution. 


Example 1 


Two-Sided Test A cost analyst in the Toluca Company is interested in testing，using 
regression model (2.1), whether or not there is a linear association between work hours and 
lot size ， i.e” whether or not = 0. The two alternatives then are: 


(2.16) 


Ho ： A = 0 
Ha ： A # 0 

The analyst wishes to control the risk of a Type I error ata = .05. The conclusion H a could 


be reached at once by referring to the 95 percent confidence interval for 色 construcl^d 
earlier, since this interval does not include 0. 

An explicit test of the alternatives (2.16) is based on the test statistic: 


by 


s[bi} 


(2.17) 


(2.18) 


The decision rule with this test statistic for controlling the level of significance at a is: 

If |^*| < t{\ — a/2;n — 2), conclude Hq 
If |^| > t{\ — a/2;n ~ 2), conclude H a 

For the Toluca Company example, where a = .05, b\ = 3.5702, and 5 (^ 1 } = .3470, we 
require f (.975; 23) = 2.069. Thus, the decision rule for testing alternatives (2.16) is: 

If \t*\ < 2.069, conclude Hq 

If |f*| > 2.069, conclude H a 


Since |?*| = |3.5702/.3470| = 10.29 > 2.069, we conclude H. a , that 仏 # 0 or that 
there is a linear association between work hours and lot size. The value of the test statistic, 
t* = 10.29, is shown in the MINITAB output in Figure 2.2 in the column labeled t-ratio 
and the row labeled X. 

The two-sided P-value for the sample outcome is obtained by first finding the one¬ 
sided P-value, P{t (23) > t* = 10.29}. We see from Table B.2 that this probability is 
less than .0005. Many statistical calculators and computer packages will provide the actual 
probability; it is almost 0, denoted by 0+. Thus, the two-sided P-value is 2(0+) = 0+. 
Since the two-sided P-value is less than the specified level of significance a = .05, we 
could conclude H a directly. The MINITAB output in Figure 2.2 shows the P-value in the 
column labeled p, corresponding to the row labeled X. It is shown as 0.000. 

Comment 


When the test of whether or not Pi = 0 leads to the conclusion that / 0, the association between 
Y and X is sometimes described to be a linear statistical association. ■ 


Example 2 


One-Sided Test Suppose the analyst had wished to test whether or not ^ is positive, 
controlling the level of significance ata = .05. The alternatives then would be: 

Ho： Pi<0 
H a : 灼 > 0 



48 Part One Simple Linear Regression 


and the decision rule based on test statistic (2.17) would be: 

If t* < t(l — ct\n — 2), conclude Ho 
If t* > t(l — a;n — 2), conclude H a 

Fora = .05, we require t (.95; 23) = 1.714. Since 产 =10.29 > 1.714, we would conclude 
that is positive. 

This same conclusion could be reached directly from the one-sided P-value, which was 
noted in Example 1 to be 0+. Since this P-value is less than .05, we would conclude H a . 


Comments 

1. The P-value is sometimes called the observed level of significance. ^ 

2. Many scientific publications commonly report the P-value together with the value of the test 
statistic. In this way, one can conduct a test at any desired level of significance a by comparing the 
P_value with the specified level a. 

3. Users of statistical calculators and computer packages need to be careful to ascertain whether 
one-sided or two-sided P-values are reported. Many commonly used labels, such as PROB or P, do 
not reveal whether the P-value is one- or two-sided. 

4. Occasionally, it is desired to test whether or not equals some specified nonzero value ^ 10j 
which may be a historical norm, the value for a comparable process, or an engineering specification. 


The alternatives now are: 

H o._ 铃 1 = Pw 

Ha' ★ P'0 

( 2 . 19 ) 

and the appropriate test statistic is: 

* — ^io 

Hbi) 

( 2 . 20 ) 


The decision rule to be employed here still is (2.18), but it is now based on t* defined in (2.20). 

Note that test statistic (2.20) simplifies to test statistic (2.17) when the test involves H 0 : Pi = 

^ro = 0. ■ 


2.2 Inferences Concerning 


As noted in Chapter 1, there are only infrequent occasions when we wish to make inferences 
concerning the intercept of the regression line. These occur when the scope of the model 
includes X = 0. 

Sampling Distribution of b 0 

The point estimator bo was given in (1.10b) as follows: 

b Q = Y-b x X ( 2 . 21 ) 

The sampling distribution of b 0 refers to the different values of bo that would be obtained 
with repeated sampling when the levels of the predictor variable X are held constant from 



Chapter 2 Inferences in Regression and Correlation Analysis 49 


sample to sample. 

For regression model (2.1)，the sampling distribution of 
is normal，with mean and variance: 是， 

E{b 0 } = j6 0 


<y 2 {bo\ = o 2 


1 X 2 

n + E(^/ - X) 2 


( 2 . 22 ) 

(2.22a) 

P.22b) 


The normality of the sampling distribution of bo follows because bo, like is a linear 
combination of the observations Y t . The results for the mean and variance of the sampling 
distribution of bo can be obtained in similar fashion as those for bi ， 

An estimator of cr 2 {bo} is obtained by replacing cr 2 by its point estimator MSE: 


s 2 {b 0 } = MSE 


+ 


X 2 


n E(U 2 


The positive square root, is an estimator of a {fe 0 }- 


^ ( 2 . 23 ) 


Sampling Distribution of (b 0 — Po)/s{bo} 

Analogous to theorem (2.10) for b\, a theorem for fc。states: 

办 o _ A 。 

-- is distributed as t (n — 2) for regression model (2.1) ( 2 . 24 ) 

灿 o} 

Hence, confidence intervals for and tests concerning j6o can be set up in ordinary fashion， 
using the t distribution. 


Confidence Interval for p 0 

The \ — a confidence limits for 仇 are obtained in the same manner as those for 的 derived 
earlier. They are: 

b 0 ±t(l — a/2; n — 2)s{bo\ ( 2 . 25 ) 


Example 


As noted earlier, the scope of the model for the Toluca Company example does not extend to 
lot sizes, of X = 0. Hence, the regression parameter )0 O may not have intrinsic meaning here. 
If, nevertheless, a 90 percent confidence interval for 仇 were desired, we would proceed by 
finding f (.95; 23) and From Table B.2, we find f (.95; 23) = 1.714. Using the earlier 
results summarized in Table 2.1, we obtain by (2.23): 


or: 


s 2 {b 0 } = MSE 


1 X 2 

n + E ( 兄 一 X ) 2 


= 2,384 


1 (70.00) 2 

25 + 19,800 


= 685.34 


5 (^ 0 } = 26.18 


The MINITAB output in Figure 2.2 shows this estimated standard deviation in the column 
labeled Stdev and the row labeled Constant. 必 

The 90 percent confidence interval for 仇 is: 

62.37 - 1.714(26.18) < < 62.37 + 1.714(26.18) 

17.5 < < 107.2 



50 Part Qne Simple Linear Regression 

We caution again that this confidence interval does not necessarily provide meaningful 
information. For instance, it does not necessarily provide information about the “setup” 
cost (the cost incurred in setting up the production process for the part) since we are not 
certain whether a linear regression model is appropriate when the scope of the model is 
extended to X = 0. 


2.3 Some Considerations on Making Inferences Concerning 
_ 戶 o and _ 


Effects of Departures from Normality , * 

If the probability distributions of F are not exactly normal but do not depart seriously, 
the sampling distributions of and b\ will be approximately normal, and the use of the 
t distribution will provide approximately the specified confidence coefficient or level of 
significance. Even if the distributions of Y are far from normal, the estimators and bi 
generally have the property of asymptotic normality —their distributions approach normality 
under very general conditions as the sample size increases. Thus, with sufficiently large 
samples, the confidence intervals and decision rules given earlier still apply even if the 
probability distributions of F depart far from normality. For laige samples, the t value is, 
of course, replaced by the z value for the standard normal distribution. 

Interpretation of Confidence Coefficient and Risks of Errors 

Since regression model (2.1) assumes that the X ； are known constants, the E confidence 
coefficient and risks of errors are interpreted with respect to taking repeated samples in 
which the X observations are kept at the same levels as in the observed sample. For instance, 
we constructed a confidence interval for 仏 with confidence coefficient .95 in the Toluca 
Company example. This coefficient is interpreted to mean that if many independent samples 
are taken where the levels of X (the lot sizes) are the same as in the data set and a 95 percent 
confidence interval is constructed for each sample，95 percent of the intervals will contain 
the true value of 卢卜 

Spacing of the X Levels 

Inspection of formulas (2.3b) and (2.22b) for the variances of b x and bo, respectively, 
indicates that for given n and a 2 these variances are affected by the spacing of the X 
levels in the observed data. For example, the greater is the spread in the X levels, the larger 
is the quantity — X) 2 and the smaller is the variance of bi. We discuss in Chapter 4 
how the X observations should be spaced in experiments where spacing can be controlled. 

Power of Tests 

The power of tests on jg 0 and 仇 can be obtained from Appendix Table B.5. Consider, for 
example, the general test concerning in (2.19): 


Ho'. Pi = Pio 
H a ： ^ Ao 



Chapter 2 Irtferences in Regression and Correlation Analysis 51 
for which test statistic (2.20) is employed: 

t 氺一 — 卢 io 

■5{^i} 

and the decision rule for level of significance a is given in (2.18): 

If |?*| < t{\ — a/2;n — 2), conclude Hq 
If |^| > t{\ — a/2;n — 2), conclude H a 

The power of this test is the probability that the decision rule will lead to conclusion H a 
when H a in fact holds. Specifically, the power is given by: 


Power = P{|^| > t{\ - a/2;n -2)\8} ^ P.26) 

where 8 is the noncentrality measure — i.e.，a measure of how far the true value of is from 
卢 10 : 


l^i ~ ^iol 


( 2 . 27 ) 


Table B.5 presents the power of the two-sided t test for a = .05 and a = .01, for various 
degrees of freedom df .To illustrate the use of this table, let us return to the Toluca Company 
example where we tested: 


Ho ： )0] = 0 IO = 0 
H a ： ^ Ao = 0 


Suppose we wish to know the power of the test when = 1.5. To ascertain this, we need 
to know cr 2 , the variance of the error terms. Assume, based on prior information or pilot 
data, that a reasonable planning value for the unknown variance is cr 2 = 2,500, so G 2 {b\} 
for our example would be: 


a 2 (M 


o 


2,500 


— X) 2 19,800 


=.1263 


or a{bi} = .3553. Then 5 = 11.5 — 0| + .3553 = 4.22. We enter Table B.5 for a = .05 (the 
level of significance used in the test) and 23 degrees of freedom and interpolate linearly 
between 8 = 4.00 and 8 — 5.00. We obtain: 


.97 + 


i.22 - 4.00 
5.00- 4.00 


(1.00- .97) = .9766 


Thus, if = 1.5, the probability would be about .98 that we would be led to conclude 
H a (fii ^ 0). In other words, if 仏 =1.5, we would be almost certain to conclude that there 
is a linear relation between work hours and lot size. 

The power of tests concerning can be obtained from Table B.5 in completely analogous 
fashion. For one-sided tests，Table B.5 should be entered so that one-half the level of 
significance shown there is the level of significance of the one-sided test. 



52 Part One Simple Linear Regression 


2.4 Interval Estimation of E{Yh} 


A common objective in regression analysis is to estimate the mean for one or more prob¬ 
ability distributions of Y. Consider, for example, a study of the relation between level of 
piecework pay (X) and worker productivity (F). The mean productivity at high and medium 
levels of piecework pay may be of particular interest for purposes of analyzing the bene¬ 
fits obtained from an increase in the pay. As another example, the Toluca Company was 
interested in the mean response (mean number of work hours) for a range of lot sizes for 
purposes of finding the optimum lot size. 

Let Xh denote the level of X for which we wish to estimate the mean response. Xh may 
be a value which occurred in the sample, or it may be some other value of the predictor 
variable within the scope of the model. The mean response when X = X* is denoted by 
E{Y h }. Formula (1.12) gives us the point estimator Yh of E{Y h }: 

Y h =b G + b.Xh ' (2.28) 

We consider now the sampling distribution of Y h . 


Sampling Distribution of ? h 

The sampling distribution of Yh, like the earlier sampling distributions discussed, refers to 
the different values of % that would be obtained if repeated samples were selected, each 
holding the levels of the predictor variable X constant, and calculating Yh for each sample. 


For normal error regression model (2.1), the sampling distribution of 
% is normal, with mean and variance: 


(2.29) 


E{Y h } = E[Y h } 


(2.29a) 


0 2 {t h } = G 2 


1 - X) 2 

n + - X) 2 


(2.29b) 


Normality. The normality of the sampling distribution of Yh follows directly from the 
fact that Yh, like bo and is a linear combination of the observations Yi. 

八 _ _ 

Mean. Note from (2.29a) that Yh is an unbiased estimator of E[Yh}. To prove this, we 
proceed as follows: 

E{Y h } = E{bo + biX h } = E{b 0 } + = A) + ^\X h 

by (2.3a) and (2.22a). 

Variance. Note from (2.29b) that the variability of the sampling distribution of % is 
affected by how far Xh is from X, through the term (Xh — X) 2 . The further from X is 
X h , the greater is the quantity (Xh — X) 2 and the larger is the variance of Yh- An intuitive 
explanation of this effect is found in Figure 2.3. Shown there are two sample regression 
lines, based on two samples for the same set of X values. The two regression lines are 
assumed to go through the same (X, F) point to isolate the effect of interest, namely, the 
effect of variation in the estimated slope b\ from sample to sample. Note that at Xi, near 
X, the fitted values ^ for the two sample regression lines are close to each other. At X 2 , 
which is far from X, the situation is different. Here，the fitted values Y 2 differ substantially. 




Chapter 2 Irferen^es in Regression and Correlation Analysis S3 


FIGURE 23 
Effect on y /, of 
Variation in b\ 
from Sample to 
Sample in Two 
Samples with 
Same Means F 
and X. 



I 


Thus, variation in the slope from sample to sample has a much more pronounced effect 
on Yh forX levels far from the mean X than for X levels near X. Hence, the variation in the 
Yh values from sample to sample will be greater when Xk is far from the mean than when 
Xh is near the mean. 

When MSE is substituted for cr 2 in (2.29b), we obtain s 2 {Yh), the estimated variance 

off,: 


s 2 {Y h } = MSE 


1 (X h - X) 2 
孓十足 _X) 2 


(2.30) 


A A rt A. 

The estimated standard deviation of Yh is then the positive square root of 


Comments 

1. When X h = 0, the variance of Y h in (2.29b) reduces to the variance of b G in (2.22b). Similady, 
5 2 {Fft} in (2.30) reduces to 5 2 {^o} in (2.23). The reason is that Y h = b 0 when X h = 0 since % = 
bo + bfXf,. 

2. To derive a 2 {1^,}, we first show that and Y are uncorrelated and, hence, for regression model 
(2.1) ， independent: 

cr{F,fc,} =0 (2.31) 

where o { Y, bi} denotes the covariance between Y and bi ，We begin with the definitions'. 

? = E { l) Yi b ' =乙啡 

where k ； is as defined in (2,4a), We now use (A,32), with a t — \/n and = &; remember that the 
Yi are independent random variables: 

心= ■作 } = f 

But we know from (2.5) that ^2 h = 0- Hence, the covariance is 0. 

Now we are ready to find the variance of 匕 , We shall use the estimator in the alternative form (1.15): 

a 2 {t h } = a 2 {f ^-b,{X h -X)} 



54 Part One Simple Linear Regression 


Since Y and bi are independent and X h and X are constants, we obtain; 

0 2 {%)= 0 2 { 7 } + ^-^) 2 0 2 {^) 
Now o 2 [b\] is given in (2,3b), and: 

a 2 [f} = = ^ 

n n 

Hence: 


(J 2 {Y h ) = — + (X, - X ) 2 


n 


Ed _ 奸 


which, upon a slight rearrangement of terms, yields (2,29b). 




Sampling Distribution of (?h — £{Vih})/^{^} ’ 

Since we have encountered the t distribution in each type of inference for regression 
model (2.1) up to this point, it should not be surprising that: 


——— f 出 I is distributed as t (n — 2) for regression model (2.1) (232) 

必 } 

Hence, all inferences concerning E[Yh \ are carried out in the usual fashion with the t 
distribution. We illustrate the construction of confidence intervals, since in practice these 
are used more frequently than tests. 


Confidence Interval for E{Y h } 

A confidence interval for E{Yh\ is constructed in the standard fashion, making use of the t 
distribution as indicated by theorem (2.32). The 1 — confidence limits are: 

Y h ±t{\ - a/2; n - 2)s{Y h } (2.33) 


Example 1 


Returning to the Toluca Company example，let us find a 90 percent confidence interval for 
E{Yh} when the lot size is Xh = 65 units. Using the earlier results in Table 2.1, we find the 
point estimate Yh' 

Y h = 62.37 + 3.5702(65) = 294.4 


八 

Next, we need to find the estimated standard deviation 5(1^}. We obtain, using (2.30): 


s 2 {Y h ] = 2,384 


1 (65 -70.00) 21 

25 + ^19,800 _ 


98.37 


s{Y h } = 9.918 


For a 90 percent confidence coefficient, we require / (.95; 23) = 1.714. Hence, our confi¬ 
dence interval with confidence coefficient .90 is by (2.33): 

294.4- 1.714(9.918) < E{Y h } < 294.4+1.714(9.918) 

277.4 < E{Y h ) <311.4 


We conclude with confidence coefficient .90 that the mean number of work hours required 
when lots of 65 units are produced is somewhere between 277.4 and 311.4 hours. We see 
that our estimate of the mean number of work hours is moderately precise. 



Chapter 2 Inferences in Regression and Correlation Analysis 55 


Example 2 


Suppose the Toluca Company wishes to estimate E[Yh\ for lots with Xh = 100 units with 
a 90 percent confidence interval. We require: 


Y h 

s 2 {f h } 


62.37 + 3.5702(100) = 419.4 


2,384 


1 (100-70.00) 2 

25 + 19,800 


: 203.72 


s{Y h } = 14.27 
f (.95; 23 ) 二 1.714 


Hence, the 90 percent confidence interval is: 


419.4 - 1.714(14.27) < E[Y h } < 419.4 + 1.714(14.27) 

3949 < E{Y h ) < 443.9 i 

Note that this confidence interval is somewhat wider than that for Example 1, since the 

Xh level here (Xh = 100) is substantially farther from the mean X = 70,0 than the 

level for Example 1 (X h = 65). 

Comments 

1, Since the X ； are known constants in regression model (2.1), the interpretation of confidence 
intervals and risks of errors in inferences on the mean response is in terms of taking repeated 
samples in which the X observations are at the same levels as in the actual study. We noted this 
same point in connection with inferences on Po and 氏 . 

2, We see from formula (2,29b) that, for given sample results, the variance of Y h is smallest when 
X h = X. Thus, in an experiment to estimate the mean response at a particular level X h of the 
predictor variable, the precision of the estimate will be greatest if (everything else remmning equal) 
the observations on X are spaced so that X = X k , 

3, The usual relationship between confidence intervals and tests applies in inferences concerning the 
mean response. Thus, the two-sided confidence limits (2.33) can be utilized for two-sided tests 
concerning the mean response at X h . Alternatively, a regular decision rule can be set up. 

4, The confidence limits (2.33) for a mean response £{7；,} are not sensitive to moderate departures 
from the assumption that the error terms are normally distributed. Indeed, the limits are not sensitive 
to substantial departures from normality if the sample size is large. This robustness in estimating 
the mean response is related to the robustness of the confidence limits for j6o and noted earlier. 

5, Confidence limits (2.33) apply when a single mean response is to be estimated from the study. We 

discuss in Chapter 4 how to proceed when several mean responses are to be estimated from the 
same data ■ 


I 

2.5 Prediction of New Observation 


We consider now the prediction of a new observation y corresponding to a given level X of 
the predictor variable. Three illustraticms where prediction of la new observation is needed 
follow. 

1. In the Toluca Company example, the next lot to be produced consists of 100 units and 
management wishes to predict the number of work hours for this particular lot. 




56 Part One Simple Linear Regression 


2. An economist has estimated the regression relation between company sales and number 
of persons 16 or more years old from data for the past 10 years. Using a reliable de¬ 
mographic projection of the number of persons 16 or more years old for next year, the 
economist wishes to predict next year’s company sales, 

3. An admissions officer at a university has estimated the regression relation between 
the high school grade point average (GPA) of admitted students and the first-year college 
GPA. The jofficer wishes to predict the first-year college GPA for an applicant whose 
high school GPA is 3.5 as part of the information on which an admissions decision will 
be based. 

The new observation on Y to be predicted is viewed as the result of a new trial, inde¬ 
pendent of the trials on which the regression analysis is based. We denote the l^el of X 
for the new trial as Xh and the new observation on Y as l^( new ). Of course, "we assume 
that the underlying regression model applicable for the basic sample data continues to be 
appropriate for the new observation. 

The distinction between estimation of the mean response E[Yh}^ discussed in the pre¬ 
ceding section, and prediction of a new response discussed now, is basic. In the 

former case, we estimate the mean of the distribution of Y. In the present case, we predict 
an individual outcome drawn from the distribution of Y. Of course, the great majority of 
individual outcomes deviate from the mean response, and this must be taken into account 
by the procedure for predicting y/j (new) . 

Prediction Interval for Vh(new) when Parameters Known 

To illustrate the nature of a prediction interval for a new observation F/j^ew) in as simple a 
fashion as possible, we shall first assume that all regression parameters are known. Later 
we drop this assumption and make appropriate modifications. 

Suppose that in the college admissions example the relevant parameters of the regression 
model are known to be: 

.10 ： .95 

E{Y) = .10 + .95X 

g — .12 

The admissions officer is considering an applicant whose high school GPA is Xh = 3,5. 
The mean college GPA for students whose high school average is 3.5 is: 

E{Y h }^ .10+ .95(3.5) = 3.425 

Figure 2.4 shows the probability distribution of Y for X fl = 3.5. Its mean is E{Y h } = 3.425, 
and its standard deviation is cr = .12. Further, the distribution is normal in accord with 
regression model (2.1). 

Suppose we were to predict that the college GPA of the applicant whose high school 
GPA is Xu = 3,5 will be between: 

五 {1^} 土 3(7 
3.425 ± 3(.12) 

so that the prediction interval would be: 

3.065 < ^(new) < 3.785 



Chapter 2 Irferences in Regression and Correlation Analysis 57 



Since 99.7 percent of the area in a normal probability distribution falls within three standard 
deviations from the mean, the probability is .997 that this prediction interval will give a 
correct prediction for the applicant with high school GPA of 3.5. While the prediction limits 
here are rather wide, so that the prediction is not too precise, the prediction interval does 
indicate to the admissions officer that the applicant is expected to attain at least a 3.0 GPA 
in the first year of college. 

The basic idea of a prediction interval is thus to choose a range in the distribution of Y 
wherein most of the observations will fall, and then to declare that the next observation will 
fall in this range. The usefulness of the prediction interval depends, as always, on the width 
of the interval and the needs for precision by the user. 

In general, when the regression parameters of normal error regression model (2,1) are 
known, the l — a prediction limits for l^( new ) are: 

E{Y h }±z{l-a/2)a (2.34) 

In centering the limits around we obtain the narrowest interval consistent with the 

specified probability of a correct prediction. 

Prediction Interval for Yh( n€ w) when Parameters Unknown 

When the regression parameters are unknown, they must be estimated. The mean of the 
distribution of Y is estimated by Y h „ as usual, and the variance of the distribution of Y 
is estimated by MSE. We cannot, however simply use the prediction limits (2.34) with 
the parameters replaced by the corresponding point estimators. The reason is illustrated 
intuitively in Figure 2.5. Shown there are two probability distributions of F, corresponding to 
the upper and lower limits of a confidence interval for £{3^}. In other words, the distribution 
of Y could be located as far left as the one shown, as far right as the other one shown, or 
anywhere in between. Since we do not know the mean E[Y h } and only estimate it by a 
confidence interval, we cannot be certain of the location of the distribution of Y. 

Figure 2.5 also shows the prediction limits for each of the two probability distribu¬ 
tions of Y presented there. Since we cannot be certain of the location of the distribution 



58 Part One Simple Linear Regression 


FIGURE 2.5 
Prediction of 
F/ ( (new) when 
Parameters 
Unknown. 



of Y t prediction limits for F ft ( ne w) clearly must take account of two elements，as shown in 
Figure 2.5: 

1. Variation in possible location of the distribution of Y. 

2. Variation within the probability distribution of Y. 

Prediction limits for a new observation }^ (new ) at a given level Xh are obtained by means 
of the following theorem: 

八 

^ fe(new) ——— is distributed as t(n — 2) for normal error regression model (2.1) (2.35) 

5{pred} 

八 

Note that the studentized statistic (2.35) uses the point estimator Y h in the numerator rather 
than the true mean E{Yh) because the true mean is unknown and cannot be used in making a 
prediction. The estimated standard deviation of the prediction, ^{pred}, in the denominator 
of the studentized statistic will be defined shortly. 

From theorem (2.35), it follows in the usual fashion that the 1 — « prediction limits for 

八 

a new observation Fft( new ) are (for instance, compare (2.35) to (2.10) and relate Yh to b\ and 

^ft(new) tO 

Yh ± t{\ — «/2; n — 2),y{pred} (2.36) 

Note that the numerator of the studentized statistic (2.35) represents how far the new 

A. 

observation F/j (new ) will deviate from the estimated mean Yh based on the original n cases in 
the study. This difference maybe viewed as the prediction error, with serving as the best 
point estimate of the value of the new observation F"" (new ). The variance of this prediction 
error can be readily obtained by utilizing the independence of the new observation 〜■) and 
the original n sample cases on which 么 is based. We denote the variance of the prediction 
error by cr 2 {pred}, and we obtain by (A.31b): 

a 2 {pmd} = a 2 [Y h{new) - Y h ) = o- 2 {y, (new) } + a 2 {Y h ) - a 2 + o 2 {Y h ) (2.37) 

Note that cr 2 {pred} has two components: 

1. The variance of the distribution of Y at X = Xh, namely a 2 . 

2. The variance of the sampling distribution of Yh, namely G 2 {Y h ). 



Chapter 2 Itferences in Regression and Correlation Analysis 59 


An unbiased estimator of cr 2 {pred} is: 

s 2 {pred} = MSE + 5 2 {F A } P.38) 

which can be expressed as follows, using (2.30): 

/{_} = [1 + 丄 + H H (2.38a) 

L n 1,0(i - X ) 2 」 

The Toluca Company studied the relationship between lot size and work hours primarily 
to obtain information on the mean work hours required for different lot sizes for use in 
determining the optimum lot size. The company was also interested, however, to see whether 
the regression relationship is useful for predicting the required work hours for individual 
lots. Suppose that the next lot to be produced consists ofX* = 100 units and that a 90 jlercent 
prediction interval is desired. We require f(.95; 23) = 1.714. From earlier work, we have: 

Y h = 419.4 s 2 {Y h } = 203.72 MSE 二 2,384 
Using (2.38), we obtain: 

j 2 {pred} = 2,384 + 203.72 = 2,587.72 
j{pred} = 50.87 

Hence, the 90 percent prediction interval for l^ (new) is by (2.36): 

419.4- 1.714(50.87) < Y h{new) <419.4+ 1.714(50.87) 

332.2 < ^(new) < 506.6 

With confidence coefficient .90, we predict that the number of work hours for the next 
production run of 100 units will be somewhere between 332 and 507 hours. 

This prediction interval is rather wide and may not be too useful for planning worker 
requirements for the next lot. The interval can still be useful for control purposes, though. 
For instance, suppose that the actual work hours on the next lot of 100 units were 550 hours. 
Since the actual work hours fall outside the prediction limits, management would have an 
indication that a change in the production process may have occurred and would be alerted 
to the possible need for remedial action. 

Note that the primary reason for the wi^e prediction interval is the large lot-to-lot vari¬ 
ability in work hours for any ghien lot size; MSE — 2,384 accounts for 92 percent of 
the estimated prediction variance j 2 {pred} = 2,587.72. It may be that the large lot-to-lot 
variability reflects other factors that affect the required number of work hours besides lot 
size, such as the amount of experience of employees assigned to the lot production. If so, a 
multiple regression model incorporating these other factors might lead to much more pre¬ 
cise predictions. Alternatively, a designed experiment could be conducted to determine the 
main factors leading to the large lot-to-lot variation. A quality improvement program would 
then use these findings to achieve more uniform performance, for example, by additional 
training of employees if inadequate training accounted for much of the variability. 




60 Part One Simple Linear Regression 


Comments 

1. The 90 percent prediction interval for Yh(nev/) obtmned in the Toluca Company example is wider 
than the 90 percent confidence interval for E[Y h ) obtmned in Example 2 on page 55. The reason is 
that when predicting the work hours required for a new lot, we encounter both the variability in % 
from sample to sample as well as the lot-to-lot variation within the probability distribution of Y. 

2. Formula (2.38a) indicates that the prediction interval is wider the further X h is from X. The 

八 

reason for this is that the estimate of the mean h,, as noted earlier, is less precise as is located 
ferther away from X. 

3. The prediction limits (2.36), unlike the confidence limits (2.33) for a mean response ^{^}, 

are sensitive to departures from normality of the error terms distribution. In Chapter 3, we discuss 
diagnostic procedures for examining the nature of the probability distribution of the error terms, and 
we describe remedial measures if the departure from normality is serious. ^ 

4. The confidence coefficient for the prediction limits (2.36) refers to the taking of repeated 
samples based on the same set of X values, and calculating prediction limits for Fft(new) for each 
sample. 

5. Prediction limits (2.36) apply for a single prediction based on the sample data. Next, we discuss 
how to predict the mean of several new observations at a given and in Chapter 4 we take up how 
to make several predictions at different Xf, levels. 

6. Prediction intervals resemble confidence intervals. However, they differ conceptually. A confi¬ 

dence interval represents an inference on a parameter and is an interval that is intended to cover the 
value of the parameter. A prediction interval, on the other hand, is a statement about the value tg'be 
taken by a random variable, the new observation K ft ( new) , ■ 


Prediction of Mean of m New Observations for Given X h 

Occasionally, one would like to predict the mean of m new observations on Y for a given 
level of the predictor variable. Suppose the Toluca Company has been asked to bid on a 
contract that calls for m — 3 production runs ofXh = 100 units during the next few months. 
Management would like to predict the mean work hours per lot for these three runs and 
then convert this into a prediction of the total work hours required to fill the contract. 

We denote the mean of the new Y observations to be predicted as Y 以此叫 .It can be shown 
that the appropriate 1 — a prediction limits are, assuming that the new Y observations are 
independent: 

Yh ± t{\ —a/2\ n — 2)5{predmean} (2.39) 


where: 


0 , MSE 0 - 

s {preamean} :- h s {Yh} 


or equivalently; 


5 ,2 {predmean} = MSE 


11 {X h - X) 2 i 
m ; 十 - X) 2 


Note from (2.39a) that the variance j 2 {predmean} has two components: 


(2.39a) 


(2.39b) 


1. The variance of the mean of m observations from the probability distribution of Y at 
X = X h , 

2. The variance of the sampling distribution of 



Chapter 2 Inferences in Regression and Correlation Analysis 61 


Example 


In the Toluca Company example，let us find the 90 percent prediction interval for the mean 
number of work hours Fft (new) in three new production runs, each for Xh = 100 units. From 
previous work, we have: 

Y h = 419.4 s 2 {Y h } = 203.72 

MSE= 2,384 r(.95;23) = 1.714 

Hence, we obtain: 

2 384 

j 2 {predmean} : — + 203.72 - 998.4 

^{predmean} = 31,60 

The prediction interval for the mean work hours per lot then is: 

419.4-1.714(31.60) < Y hinew) < 419.4+1.714(31.60) 

365.2 < f^new) < 473.6 1 

Note that these prediction limits are narrower than those for predicting the work hours 
for a single lot of 100 units because they involve a prediction of the mean work hours for 
three lots. 

We obtain the prediction interval for the total number of work hours for the three lots by 
multiplying the prediction limits for by 3: 

1,095.6 = 3(365.2) < Total work hours < 3(473.6) : 1,420.8 

Thus, it can be predicted with 90 percent confidence that between 1,096 and 1,421 work 
hours will be needed to fill the contract for three lots of 100 units each. 

Comment 

The 90 percent prediction interval for P',(ne W )，obtained for the Toluca Company example above, is 
narrower than that obtained for J^(ne W ) on page 59, as expected. Furthermore, both of the prediction in¬ 
tervals are wider than the 90 percent confidence interval for E{Y } 1 \obtained in Example 2 on page 55 — 
also as expected. ■ 


2.6 Confidence Band for Regression Line 


At times we would like to obtain a confidence band for the entire regression line E{Y )= 
+ This band enables us to see the region in which the entire regression line lies. It 
is particularly useful for determining the appropriateness of a fitted regression function，as 
we explain in Chapter 3. ' 

The Working-Hotelling l — a confidence band for the regression line for regression 
model (2,1) has the following two bountiary values at any level X": 

Y h ± Ws[Y h } (2.40) 

where: 

W 2 = 2 F(l-«;2,n-2 ) (2.40a) 

八八 

and Yh and are defined in (2.28) and (2.30), respectively. Note that the formula 
for the boundary values is of exactly the same form as formula (2.33) for the confidence 
limits for the mean response at Xh ，except that the t multiple has been replaced by the W 



62 Part One Simple Linear Regression 


150 - / 

100 

50_i_I_i_I_I_I_I_I_I 

20 30 40 50 60 70 80 90 100 110 

Lot Size X 


FIGURE 2.6 

Confidence 

Band for 

Regression 

Line~Toluca 

Company 

Example. 


Example 


multiple. Consequently, the boundary points of the confidence band for the regression line 
are wider apart the further 心 is from the mean X of the X observations. The W multiple 
will be larger than the t multiple in (2.33) because the confidence band must encompass 
the entire regression line, whereas the confidence limits fc*r E{Y h } at Xh apply only at the 
single level X h . 

We wish to determine how precisely we have been able to estimate the r^ression function 
for the Toluca Company example by obtaining the 90 percent confidence band for the 
regression line. We illustrate the calculations of the boundary values of the confidence band 
when X h = 100, We found earlier for this case: 


^=419.4 s{Y h } = 14.27 

We now require: 

W 2 = 2F(l 一 a; 2, w — 2) = 2F(.90; 2,23) = 2(2.549) 
W = 2.258 




5.098 


100 are 


Hence, the boundary values of the confidence band for the regression line at X h 
419,4 土 2,258(14.27), and the confidence band there is: 

387.2 < ^o + ^Xh < 451.6 for X h = 100 

In similar fashion, we can calculate the boundary values for other values of Xh by 
obtaining Yh and for each Xh level from (2.28) and (2.30) and then finding the 
boundary values by means of (2.40), Figure 2.6 contains a plot of the confidence band for 
the regression line. Note that at X h = 100, the boundary values are 387,2 and 451.6, as we 
calculated earlier. 

We see from Figure 2,6 that the regression line for the Toluca Company example has 
been estimated fairly precisely. The slope of the regression line is clearly positive, and the 
levels of the regression line at different levels of X are estimated fairly precisely except for 
small and large lot sizes. 



o o ,0 o o o o 

0 5 0 5 0 5 0 
5 4 4 3 3 2 2 

/ s」noH 



Chapter 2 Irferences in Regression and Correlation Analysis 63 


Comments 

1, The boundary values of the confidence band for the regression line in (2.40) define a hyperbola, 
as maybe seen by replacing Y h and 5{^} by their definitions in (2.28) and (2.30), respectively: 


+ 士 


1 (X- X) 2 

n + J ： ( 兄 - X)2 


1/2 


(2.41) 


2. The boundary values of the confidence band for the regression line at any value X h often are 

not substantially wider than the confidence limits for the mean response at that single Xf t level. In 
the Toluca Company example, the t multiple for estimating the mean response at X /； = 100 with a 
90 percent confidence interval was f (.95; 23) = 1.714. This compares with the W multiple for the 
90 percent confidence band for the entire regression line of W = 2,258, With the somewhat wider 
limits for the entire regression line, one is able to draw conclusions about any and all mean responses 
for the entire regression line and not just about the mean response at a given X level. Some uses of 
this broader base for inference will be explained in the next two chapters. ^ 

3. The confidence band (2,40) applies to the entire regression line over all real-numbered values 
of X from —oo to oo. The confidence coefficient indicates the proportion of time that the estimating 
procedure will yield a band that covers the entire line, in a long series of samples in which the X 
observations are kept at the same level as in the actual study. 

In applications, the confidence band is ignored for that part of the regression line which is not 
of interest in the problem at hand. In the Toluca Company example, for instance, negative lot sizes 
would be ignored The confidence coefficient for a limited segment of the band of interest is somewhat 
higher than 1 — a, so 1 — o: serves then as a lower bound to the confidence coefficient. 

4. Some alternative procedures for developing confidence bands for the regression line have been 

developed. The simplicity of the Working-Hotelling confidence band (2.40) arises from the fact that 
it is a direct extension of the confidence limits for a single mean response in (2.33), ■ 


2.7 Analysis of Variance Approach to Regression Analysis_ 

We now have developed the basic regression model and demonstrated its major uses. At 
this point, we consider the regression analysis from the perspective of analysis of variance. 
This new perspective will not enable us to do anything new, but the analysis of variance 
approach will come into its own when we take up multiple regression models and other 
types of linear statistical models. 

Partitioning of Total Sum of Squares 

Basic Notions. The analysis of variance approach is based on the partitioning of sums 
of squares and degrees of freedom associated with the response variable Y. To explain the 
motivation of this approach, consider again the Toluca Company example. Figure 2.7a shows 
the observations Y t for the first two production runs presented in Table 1.1. Disregarding 
the lot sizes, we see that there is variation in the number of work hours Y h as in all statistical 
data. This variation is conventionally measured in terms of the-deviations of the Y t around 
their mean Y : 

Yi-Y ( 2 . 42 ) 



64 Part One Simple Linear Regression 


FIGURE 2.7 Illustration of Partitioning of Total Deviations Yi — Y — Tbluca Company Example (not drawn to 
scale; only observations Y i and Y 2 are shown). 

⑻ （b) (c) 


Total Deviations Yj-Y Deviations V； - Y ； Deviations Y-,- Y 



Lot Size Lot Size Lot Size 

These deviations are shown by the vertical lines in Figure 2.7a. The measure of total 
variation, denoted by SSTO t is the sum of the squared deviations (2.42): 

SSTO= - Yf (2.43) 

Here SSTO stands for total sum of squares. If all Y t observations are the same, SSTO = 0. 
The greater the variation among the Ff observations, the larger is SSTO. Thus, SSTO for 
our example is a measure of the uncertainty pertaining to the work hours required for a lot, 
when the lot size is not taJken into account. 

When we utilize the predictor variable X, the variation reflecting the uncertainty con¬ 
cerning the variable Y is that of the F； observations around the fitted regression line: 

Yi-Yi (2.44) 

These deviations are shown by the vertical lines in Figure 2.7b. The measure of variation 
in the Yi observations that is present when the predictor variable X is taken into account is 
the sum of the squared deviations (2.44), which is the familiar SSE of (1.21): 

SSE= Y^(Yi - Y t ) 2 (2.45) 

Again, SSE denotes error sum of squares. If all 7/ observations fall on the fitted regression 
line, SSE = 0. The greater the variation of the Yj observations around the fitted regression 
line, the larger is SSE. 

For the Toluca Company example, we know from earlier work (Table 2.1) that: 

SSTO = 307,203 SSE= 54,825 

What accounts for the substantial difference between these two sums of squares? The 
difference, as we show shortly, is another sum of squares: 

SSR = - Y) 2 (2.46) 






Chapter 2 Inferences in Regression and Correlation Analysis 65 


where SSR stands for regression sum of squares. Note that SSR is a sum of squared deviations, 
the deviations being: 

Yi-Y (2.47) 

These deviations are shown by the vertical lines in Figure 2.7c. Each deviation is simply the 
difference between the fitted value on the regression line and the mean of the fitted values 

_ 八 一 

Y. (Recall from (1.18) that the mean of the fitted values Fi is 7.) If the regression line is 
horizontal so that F/ — F = 0, then SSR = 0. Otherwise, SSR is positive. 

SSR may be considered a measure of that part of the variability of the Yi which is 
associated with the regression line. The larger SSR is in relation to SSTO, the greater is the 
effect of the regression relation in accounting for the total variation in the Y t observations. 
For the Toluca Company example, we have: 

SSR = SSTO- SSE = 307,203 - 54,825 = 252,378 

which indicates that most of the total variability in work hours is accounted for by the 
relation between lot size and work hours. 

Formal Development of Partitioning. The total deviation Yi —Y, used in the measure of 
the total variation of the observations Y t without taking the predictor variable into account, 
can be decomposed into two components: 


(2.48) 


The two components are: 

1. The deviation of the fitted value Y t around the mean F. 

2. The deviation of the observation Y t around the fitted regression line. 



Total 

deviation 


Deviation 
of fitted 
regression 
value 

around mean 


Deviation 

around 

fitted 

regression 

line 


Figure 2.7 shows this decomposition for observation Y { by the broken lines. 

It is a remarkable property that the sums of these squared deviations have the same 
relationship: 

Y^(Yi - Y) 2 = Y) 2 + 芝 ] ⑺ 一 ft) 2 (2.49) 

or, using the notation in (2.43), (2.45), f and (2.46): 

-SSTO = SSR + SSE (2.50) 

To prove this basic result in the analysis of variance, we proceed as follows: 

- Y ) 2 = Y^KY, - f) + (Yi - Y t )f ^ 

= - 歹 ) 2 + ⑺一么 ) 2 +2(ft - YXYt - m 
= - ? ) 2 + E ⑺ 一 ft) 2 ?)(¥, - Y t ) 



66 Part One Simple Linear Regression 


The last term on the right equals zero, as we can see by expanding it: 

2 YXYt _ ft) = 2f ^(^-^)-2? ^ Y i - Yi) 

The first summation on the right equals zero by (1.20), and the second equals zero by (L17). 
Hence, (2.49) follows. 


Comment 

The formulas for SST0 9 SSR, and SSE given in (2.43), (2.45), and (2.46) are best for computational 
accuracy. Alternative formulas that are algebraically equivalent are available* One that is useful for 
deriving analytical results is: 

SSR - b\ - X) 2 / (2.51) 


Breakdown of Degrees of Freedom 

Corresponding to the partitioning of the total sum of squares SSTO, there is a partitioning 
of the associated degrees of freedom (abbreviated df). We have n — l degrees of freedom 
associated with SSTO. One degree of freedom is lost because the deviations — F are 
subject to one constraint: they must sum to zero. Equivalently, one degree of freedom is 
lost because the sample mean Y is used to estimate the population mean. 

SSE t as noted earlier, has n — 2 degrees of freedom associated with it. Two degrees of 
freedom are lost because the two parameters jg 0 and jSx are estimated in obtaining the fitted 
values Yi. 

SSR has one degree of freedom associated with it Although there aren deviations Y[ — 

A 

all fitted values F,- are calculated from the same estimated regression line. Two degrees of 
freedom are associated with a regression line, corresponding to the intercept and the slope 
of the line. One of the two degrees of freedom is lost because the deviations 7/ — F are 
subject to a constraint: they must sum to zero. 

Note that the degrees of freedom are additive: 

n — 1 = 1 +(m — 2 ) 

For the Toluca Company example, these degrees of freedom are: 

24 = 1+23 


Mean Squares 

A sum of squares divided by its associated degrees of freedom is called a mean square 
(abbreviated MS). For instance, an ordinary sample variance is a mean square since a sum 
of squares, 5^(1^ _ F) 2 , is divided by its associated degrees of freedom, n — l. We are 
interested here in the regression mean square ， denoted by MSR: 

CCD 

MSR = — = SSR (2.52) 

and in the error mean square, MSE ， defined earlier in (1.22): 


MSE = 


SSE 
n _ 2 


(2.53) 



Chapter 2 Inferences in Regression and Correlation Analysis 67 


For the Toluca Company example, we have SSR = 252,378 and SSE = 54,825. Hence: 


Also, we obtained earlier: 


MSR = 


252,378 


MSE = 


54,825 

23 


= 252,378 

= 2,384 


Comment 

The two mean squares MSR and MSE do not add to 

SSTO 307,203 …… 

-- = -=12,800 

{n - 1) 24 

Thus, mean squares are not additive. ^ ■ 

Analysis of Variance Table 

Basic Table. The breakdowns of the total sum of squares and associated degrees of 
freedom are displayed in the form of an analysis of variance table (ANOVA table) in 
Table 2.2. Mean squares of interest also are shown. In addition, the ANOVA table contains 
a column of expected mean squares that will be utilized shortly. The ANOVA table for the 
Toluca Company example is shown in Figure 2.2. The columns for degrees of freedom and 
sums of squares are reversed in the MESTITAB output. 

Modified Table. Sometimes an ANOVA table showing one additional element of decom¬ 
position is utilized. This modified table is based on the fact that the total sum of squares 
can be decomposed into two parts, as follows: 

SSTO = Y^( Y i -Y) 2 = ~ nf 1 

In the modified ANOVA table, the total uncorrected sum of squares, denoted by SSTOTJ, 
is defined as: 

SSTOU = ^Yf (2.54) 

and the correction for the mean sum of squares, denoted by ^(correction for mean), is 
defined as: 

^(correction for mean) : = n? 2 &55) 

Table 2.3 shows the general format of this modified ANOVA table. While both types of 
ANOVA tables are widely used, we shall usually utilize the basic type of table. 


TABLE 2.2 

ANOVA Table 
for Simple 

Source of 
Variation 

55 df 

MS 

’ . ccii 

E{M5} 

Linear 

Regression 

SSR = -Y) 2 , 1 

msr== 


Regression. 

Error 

SSE=YK Y i-^i) Z n ~' 2 

1 

SSE 

MSE f r u. 

n，2 

2 

a 


Total 

SSTO = 5^( n-y ) 2 n-A 





68 Part One Simple Linear Regression 


TABLE 2.3 
Modified 
ANOVA Table 

Source of 

Variation 

55 

df 

MS 

i 

for Simple 
Linear 

Regression 

SSR = Yl^i-Y) Z 

1 

Regression. 

Error 

SSE^iY,-?；) 2 

n ™ 2 

SSE 

MSE^. ■ 

' 0 — 2 


Total 


n-1 

Correction for mean 

SS(correction 

1 


for mean) — nY z 


Total, uncorrected 

SSTOU = E f 

n 


Expected Mean Squares 

In order to make inferences based on the analysis of variance approach, we need to know 
the expected value of each of the mean squares. The expected value of a mean square is the 
mean of its sampling distribution and tells us what is being estimated by the mean square. 
Statistical theory provides the following results: 

E{MSE) = g 2 P.56) 

E{MSR) = a 2 J^(X e - - X) 2 (2.57) 

The expected mean squares in (2.56) and (2.57) are shown in the analysis of variance table 
in Table 2.2. Note that result (2.56) is in accord with our earlier statement that MSE is an 
unbiased estimator of a 2 . 

T\vo important implications of the expected mean squares in (2.56) and (2.57) are the 
following: 

1. The mean of the sampling distribution of MSE is cr 2 whether or notX and Y are linearly 
related, i.e., whether or not = 0. 

2, The mean of the sampling distribution of MSR is also cr 2 when ^ =0. Hence, when 

= 0, the sampling distributions of MSR and MSE are located identically and MSR and 
MSE will tend to be of the same order of magnitude. 

On the other hand, when ^ 0, the mean of the sampling distribution of MSR is 
greater than a 2 since the term 01 — X) 2 in (2.57) then must be positive. Thus, 

when ^ 0, the mean of the sampling distribution of MSR is located to the right of that 
of MSE and, hence, MSR will tend to be larger than MSE. 

This suggests that a comparison of MSR and MSE is useful for testing whether or not 
仇 = 0. If MSR and MSE are of the same order of magnitude, this would suggest that Pi = 0. 
On the other hand, if MSR is substantially greater than MSE, this would suggest that 仇 一 0. 
This indeed is the basic idea underlying the analysis of variance test to be discussed next 

Comment 

The derivation of (2.56) follows from theorem (2.11), which states that SSE/cr 2 〜 x 2 (. n — 2) 
for regression model (2.1), Hence, it follows from property (A.42) of the chi-square distribution 



that: 


Chapter 2 Ir^erences in Regression and Correlation Analysis 69 




2 


or that: 


f SSE 、 

剜 —— « \ = E[MSE) = a 
L n — 2 J 

To find the expected value of MSR, we begin with (2.51): 

SSR = b 2 ' - Xf 

Now by (A. 15a), we have: 

<r 2 {b x ) = E{b]} - mbi)) 

We know from (2.3a) that £{£?,} = and from (2.3b) that: 

汀 2 W 


2 


<y £ 


E %- 奸 


Hence, substituting into (2.58), we obtain: 

E{b 2 t }= 

It now follows that: 




E(^- - xf 


+ 辫 




E{SSR) = E{b\) - X ) 2 = cr 2 + 抝 - Xf 


(2.58) 


Finally, is: 

E{MSR) = £|^j=tr 2 +^ ^(X ； - X ) 2 

F Test of A = 0 versus 

The analysis of variance approach provides us with a battery of highly useful tests for 
regression models (and other linear statistical models). For the simple linear regression 
case considered here, the analysis of variance provides us with a test for: 


Ho： A = 0 
H a : ) 0 , ^0 


P.59) 


Test Statistic. The test statistic for the analysis of variance approach is denoted by F*. 
As just mentioned, it compares MSR and MSEm the following fashion: 


F* = 


MSR 

MSE 


( 2 . 60 ) 


The earlier motivation, based on the expected mean squares in Table 2.2, suggests that large 
values of F* support H a and values of F* near 1 support Hq. In other words, the appropriate 
test is an upper-tail one. " " 

Sampling Distribution of F*. In order to be able to construct a statistical decision rule 
and examine its properties, we need to know the sampling distribution of F*. We begin by 
considering the sampling distribution of F* when Hq (fii = 0) holds. Cochran’s theorem 



70 Part One Simple Linear Regression 


will be most helpful in this connection. For our purposes, this theorem can be stated as 
follows: 


If all n observations R come from the same normal distribution with 
mean fi and variance a 2 , and SSTO is decomposed into k sums of 
squares SS r , each with degrees of freedom df r , then the SS r /a 2 terms 
are independent / 2 variables with df r degrees of freedom if: 


@•61) 


k 



Note from Table 2.2 that we have decomposed SSTO into the two sums of squares SSR 
and SSE and that their degrees of freedom are additive. Hence: 


If = 0 so that all F ； have the same mean ju, = j6o and the same 
variance cr 2 , SSE/a 2 and SSR/a 2 are independent x 2 variables. 

Now consider test statistic F*, which we can write as follows: 

SSR SSE 

F* _ cr 2 a 2 _ MSR 

= n-2 = MSE 

But by Cochran’s theorem, we have when Ho holds: 

p 〜迦 + when f/ 0 holds 

where the / 2 variables are independent. Thus, when Hq holds, F* is the ratio of two 
independent x 2 variables, each divided by its degrees of freedom. But this is the definition 
of an F random variable in (A .47). 

We have thus established that if Hq holds, F* follows the F distribution, specifically the 
F(l, n — 2) distribution. 

When H a holds, it can be shown that F* follows the noncentral F distribution, a complex 
distribution that we need not consider further at this time. 


Comment 

Even if 卢 | ^ 0, SSR and SSE are independent and SSE/a 2 〜； ( 2 . However, the condition that both 
SSR/a 2 and SSE/a 2 are x 2 random variables requires pi = 0. ■ 

Construction of Decision Rule. Since the test is upper-tail and F* is distributed as 
F(l,n — 2) when H 。 holds, the decision rule is as follows when the risk of a Type I error 
is to be controlled at a: 


If F* < F(1 2), conclude Hq 

If F* > F(1 — 2),conclude H a 


( 2 . 62 ) 


where F(1 — a;l,n — 2) is the (1 — a) 100 percentile of the appropriate F distribution. 



Chapter 2 Inferences in Regression and Correlation Analysis 71 


Example 


For the Toluca Company example, we shall repeat the earlier test on this time using the 
F test. The alternative conclusions are: 


H 0 : ^=0 


As before, let a = .05. Since n = 25, we require F(.95; 1,23) = 4.28. The decision rule is: 


If F* < 4.28, conclude Ho 
If F* > 4.28, conclude H a 


We have from earlier that MSR = 252,378 and MSE = 2,384. Hence, F* is: 


F* = 


252,378 

2,384 


=105.9 


Since F* = 105.9 > 4.28, we conclude H a , that 办 ^ 0, or that there Hs a linear 
association between work hours and lot size. This is the same result as when the t test was 


employed, as it must be according to our discussion below. 

The MINITAB output in Figure 2.2 on page 46 shows the F* statistic in the column 
labeled F. Next to it is shown the P-value, P{F(1, 23) > 105.9}, namely, 0+, indicating 
that the data are not consistent with = 0. 


Equivalence of F Test and f Test. For a given a level, the F test of =0 versus ^ ^ 0 
is equivalent algebraically to the two-tailed t test. To see this, recall from (2.51) that: 


Thus, we can write: 




SSR = b\ ^(X £ * - xy 


SSR+1 b\ J^(Xi - xy 


SSE+(n-2) MSE 

But since s 2 {b\} = MSE/ J^(X E - — X) 2 , we obtain: 

F* = ^ ^ = it*) 2 

The last step follows because the t* statistic for testing whether or not 仏 

* b i 
t* = - 

他} 


(2.63) 

Ois by (2.17): 


In the Toluca Company example, we just calculated that F* = 105.9. From earlier work, 
we have t* = 10.29 (see Figure 2.2). We thus see that (10.29) 2 = 105.9. 

Corresponding to the relation between t* and F*, we have the following relation between 
the required percentiles of the t and F distributions for the tests: [/(l — a/2; n — 2)] 2 = 
F(1 — a\ n — 2). In our tests on , these percentiles were [f (.975; 23)] 2 = (2.069) 2 = 
4.28 = F(.95; 1,23). Remember that the t test is two-tailed'Whereas the F test is one-tailed. 

Thus, at any given a level, we can use either the t test or the F test for testing 仇 = 0 
versus ^ ^ 0. Whenever one test leads to Ho, so will the other, and correspondingly for H a . 
The f test, however, is more flexible since it can be used for one-sided alternatives involving 
(< >) 0 versus 卢 【 （> <)0, while the F test cannot. 



72 Part One Simple Linear Regression 


2.8 General Linear Test Approach_ 

The analysis of variance test of 私 = 0 versus 爲 ^ 0 is an example of the general test for 
a linear statistical model. We now explain this general test approach in terms of the simple 
linear regression model. We do so at this time because of the generality of the approach 
and the wide use we shall make of it, and because of the simplicity of understanding the 
approach in terms of simple linear regression. 

The general linear test approach involves three basic steps, which we now describe in 
turn. 


Full Model - Z 

»Xv 

We begin with the v model considered to be appropriate for the data, which in this context is 
called the full or unrestricted model. For the simple linear regression case, the full model is 
the normal error regression model (2.1): 

Yi=^o + ^X- t + Si Full model (2.64) 

We fit this full model, either by the method of least squares or by the method of maximum 
likelihood, and obtain the error sum of squares. The error sum of squares is the sum of the 
squared deviations of each observation Yi around its estimated expected value. In this 
context, we shall denote this sum of squares by SSE(F) to indicate that it is the error sum 
of squares for the full model. Here, we have: 

SSE(F) = ⑺—伽 + 卜 X,-)] 2 - - ^ )2 = SSE ( 2 . 65 ) 

Thus, for the full model (2.64), the error sum of squares is simply SSE, which measures the 
variability of the observations around the fitted regression line. 


Reduced Model 

Next, we consider Hq. In this instance, we have: 


H 0 : ^ =0 
H a ： ^ ^0 


( 2 . 66 ) 


The model when Ho holds is called the reduced or restricted model. When ^ = 0, 
model (2.64) reduces to: 


Y } = £ t Reduced model (2.67) 

We fit this reduced model, by either the method of least squares or the method of 
maximum likelihood, and obtain the error sum of squares for this reduced model, denoted 
by SSE(R). When we fit the particular reduced model (2.67), it can be shown that the least 
squares and maximum likelihood estimator of ^ is F. Hence, the estimated expected value 
for each observation is 办 0 = ?， and the error sum of squares for this reduced model is: 

SSE(R) = bo) 2 = - Yf = SSTO (2.68) 



Chapter 2 Inferences in Regression and Correlation Analysis 73 


Test Statistic 

The logic now is to compare the two error sums of squares SSE(F) and SSE(R). It can be 
shown that SSE(F) never is greater than SSE(R): 

SSE(F) < SSE(R) P.69) 


The reason is that the more parameters are in the model, the better one can fit the data 
and the smaller are the deviations around the fitted regression function. When SSE(F) is 
not much less than SSE(R), using the full model does not account for much more of the 
variability of the F, than does the reduced model, in which case the data suggest that the 
reduced model is adequate (i.e., that Hq holds). To put this another way, when SSE(F) is 
close to SSE(R), the variation of the observations around the fitted regression function for 
the full model is almost as great as the variation around the fitted regression function for 
the reduced model. In this case, the added parameters in the full model really do not help to 
reduce the variation in the Y t about the fitted regression function. Thus, a small (iifference 
SSE(R) — SSE(F) suggests that Hq holds. On the other hand, a large difference suggests that 
H a holds because the additional parameters in the model do help to reduce substantially the 
variation of the observations 巧 around the fitted regression function. 


The actual test statistic is a function of SSE(R) — SSE(F), namely: 


F* 


SSE(R) - SSE(F) SSE(F) 


母 R ~ 母 F 


dfp 


( 2 . 70 ) 


which follows the F distribution when Hq holds. The degrees of freedom 母 r and dfp are 
those associated with the reduced and full model error sums of squares, respectively. Large 
values of F* lead to H a because a large difference SSE(R) — SSE(F) suggests that H a holds. 
The decision rule therefore is: 


If F* < F{\ — or, 却 R — dfp, dfp), conclude Hq 
I f F* > F{\ — df F , df F ), conclude H a 

For testing whether or not jg! = 0, we therefore have: 

、 SSE(R) = SSTO SSE{F) = SSE 

母 r — n — 1 dfp = m — 2 

so that we obtain when substituting into (2.70): 

… _ SSTO- SSE SSE SSR SSE MSR 
F = (n^l) -~(n-2) ^ n-2 = ~T^ : n~~2 = MSE 

which is identical to the analysis of valance test statistic (2.60). 


@•71) 


Summary 

The general linear test approach can be used for highly complex tests of linear statistical 
models, as well as for simple tests. The basic steps in summary form are: 


1. Fit the full model and obtain the error sum of squares SSE(F). 

2. Fit the reduced model under H 0 and obtain the error sum of squares SSE(R). 

3. Use test statistic (2.70) and decision rule (2.71). 



74 Part One Simple Linear Regression 


2.9 Descriptive Measures of Linear Association between X and Y 

We have discussed the major uses of regression analysis~estimation of parameters and 
means and prediction of new observations — without mentioning the “degree of linear 
association” between X and Y, or similar terms. The reason is that the usefulness of estimates 
or predictions depends upon the width of the interval and the user’s needs for precision, 
which vary from one application to another. Hence, no single descriptive measure of the 
“degree of linear association” can capture the essential information as to whether a given 
regression relation is useful in any particular application. 

Nevertheless, there are times when the degree of linear association is of interest in its 
own right. We shall now briefly discuss two descriptive measures that are^requently used 
in practice to describe the degree of linear association between X and Y f 


Coefficient of Determination 

We saw earlier that SSTO measures the variation in the observations F,-, or the uncertainty in 
predicting Y, when no account of the predictor variable X is taken. Thus, SSTO is a measure 
of the uncertainty in predicting Y when X is not considered. Similarly, SSE measures the 
variation in the Y t when a regression model utilizing the predictor variable X is employed. 
A natural measure of the effect of X in reducing the variation in Y, i.e., in reducing the 
uncertainty in predicting F, is to express the reduction in variation (SSTO — SSE = SSR) 
as a proportion of the total variation: 


生 =1 — ! 

SSTO SSTO 


(2.72) 


The measure R 2 is called the coefficient of determination. Since 0 < SSE < SSTO, it 
follows that: 


0<R 2 <1 (2.72a) 

We may interpret 7? 2 as the proportionate reduction of total variation associated with 
the use of the predictor variable X. Thus, the larger R 2 is, the more the total variation of 
Y is reduced by introducing the predictor variable X. The limiting values of R 2 occur as 
follows: 


1. When all observations fall on the fitted regression line, then SSE = 0 and R 2 = 1. 
This case is shown in Figure 2.8a. Here, the predictor variable X accounts for all variation 
in the observations . 

2. When the fitted regression line is horizontal so that bi = 0 and F ； = Y, then SSE — 
SSTO and R 2 = 0. This case is shown in Figure 2.8b. Here, there is no linear association 
between X and Y in the sample data, and the predictor variable X is of no help in reducing 
the variation in the observations F ； with linear regression. 

In practice, R 2 is not likely to be 0 or 1 but somewhere between these limits. The closer 
it is to 1, the greater is said to be the degree of linear association between X and Y. 



Chapter 2 Inferences In Regression and Correlation Analysis 75 


FIGURE 2.8 
Scatter Plots 
when R 2 ~1 
and R z = 0. 


⑻ /? 2 = 1 




(b) /? 2 = 0 


• • Y= Y 




X 


Example 


For the Toluca Company example, we obtained SSTO = 307,203 and SSR = 252,378. 
Hence: 

m 252,378 … 

R 2 = — - — = .822 
307,203 

Thus, the variation in work hours is reduced by 82.2 percent when lot size is considered. 

The MINITAB output in Figure 2.2 shows the coefficient of determination R 2 labeled 
as R-sq in percent form. The output also shows the coefficient R-sq(adj), which will be 
explained in Chapter 6. 


Limitations of R 2 

We noted that no single measure will be adequate for describing the usefulness of a regres¬ 
sion model for different applications. Still, the coefficient of determination is widely used. 
Unfortunately, it is subject to serious misunderstandings. We consider now three common 
misunderstandings: 


Misunderstanding 1. A high coefficient of determination indicates that useful 
predictions can be made. This is not necessarily correct. In the Toluca Company 
example, we saw that the coefficient of determination was high (R 2 = .82). Yet the 
90 percent prediction interval for the next lot, consisting of 100 units, was wide (332 
to 507 hours) and not precise enough to permit management to schedule workers 
effectively. 

Misunderstanding 2. A high coefficient of determination indicates that the estimated 
regression line is a goodfit^Agam, tiiis is not necessarily correct. Figure 2_9a shows 
a scatter plot where the coefficient of determination is high (R 2 = .69). Yet a linear 
regression function would not be a good fit since the regression relation is curvilinear. 

Misunderstanding 3. A coefficient of determination near zero indicates that X and Y 
are not related. This also is not necessarily correct. Figure 2.9b shows a scatter plot 
where the coefficient of determination between X and 7 is /? 2 = .02. Yet X and Y are 
strongly related; however, the relationship between the two variables is curvilinear. 



76 Part One Simple Linear Regression 


FIGURE 2.9 
Illustrations 
of Two Misun¬ 
derstandings 
about 

Coefficient of 
Determination. 


⑻ 

Scatter Plot with R 2 = .69 
Linear regression is not a good fit 


乂 


14 r 
12 
10 

8 

6 


# 


# 


# 


# 


% 


% 


# 


# 


# 


# 


0 2 


4 6 

X 


8 10 


(b) 

Scatter Plot with R 2 — .02 
Strong relation between X and Y 


15 


10 






# 






% 


# 


% 


% 


# 


# 








0 


5 10 

X 


15 


Misunderstanding 1 arises because R 2 measures only a relative reduction from SSTO 
and provides no information about absolute precision for estimating a mean response or 
predicting a new observation. Misunderstandings 2 and 3 arise because R 2 measures the 
degree of linear association between X and F, whereas the actual regression relation may 
be curvilinear. 


Coefficient of Correlation 

A measure of linear association between Y and X when both Y and X are random is the 
coefficient of correlation. This measure is the signed square root of R 2 : 

r = P.73) 

A plus or minus sign is attached to this measure according to whether the slope of the fitted 
regression line is positive or negative. Thus, the range of r is: —1 < r < 1. 

For the Toluca Company example, we obtained R 2 = . 822. Treating X as a random variable, 
the correlation coefficient here is: 

r — +V.822 = .907 

The plus sign is affixed since bi is positive. We take up the topic of correlation analysis in 
more detail in Section 2.11. 



Comments 

1. The value taken by R 2 in a given sample tends to be affected by the spacing of theX observations. 
This is implied in (2.72). SSE is not affected systematically by the spacing of the X { since, for regression 
model (2. IX fr 2 [r,-} = a 2 at all X levels. However, the wider the spacing of the X { in the sample 
when ^ 0, the greater will tend to be the spread of the observed K f around Y and hence the greater 
SSTO will be. Consequently, the wider the X,- are spaced, the higher R 2 will tend to be. 

2. The regression sum of squares SSR is often called the “explained variation” in and the residual 
sum of squares SSE is called the “unexplained variation.” The coefficient R 2 then is interpreted in terms 
of the proportion of the total variation in Y (SSTO) which has been “expla’med” by Unfortunately, 



Chapter 2 Inferences in Regression and Correlation Analysis 77 


this terminology frequently is taken literally and, hence, misunderstood. Remember that in a regression 
model there is no implication that Y necessarily depends on X in a causal or explanatory sense. 

3. Regression models do not contain a parameter to be estimated by J? 2 or r. These are simply 
descriptive measures of the degree of linear association between X and Y in the sample observations 
that may, or may not, be useful in any instance. ■ 

2.10 Considerations in Applying Regression Analysis_ 

We have now discussed the major uses of regression analysis ― to make inferences about 
the regression parameters, to estimate the mean response for a given X, and to predict 
a new observation Y for a given X. It remains to make a few cautionary remarks about 
implementing applications of regression analysis. 

1. Frequently, regression analysis is used to make inferences for the future. For jnstance, 
for planning staffing requirements, a school board may wish to predict future enrollments by 
using a regression model containing several demographic variables as predictor variables. 
In applications of this type, it is important to remember that the validity of the regression 
application depends upon whether basic causal conditions in the period ahead will be similar 
to those in existence during the period upon which the regression analysis is based. This 
caution applies whether mean responses are to be estimated, new observations predicted, 
or regression parameters estimated. 

2. In predicting new observations on the predictor variable X itself often has to be 
predicted. For instance, we mentioned earlier the prediction of company sales for next year 
from the demographic projection of the number of persons 16 years of age or older next 
year. A prediction of company sales under these circumstances is a conditional prediction, 
dependent upon the correctness of the population projection. It is easy to forget the condi¬ 
tional nature of this type of prediction. 

3. Another caution deals with inferences pertaining to levels of the predictor variable 
that fall outside the range of observations. Unfortunately, this situation frequently occurs 
in practice. A company that predicts its sales from a regression relation of company sales 
to disposable personal income will often find the level of disposable personal income of 
interest (e.g., for the year ahead) to fall beyond the range of past data. If the X level does 
not fall far beyond this range, one may have reasonable confidence in the application of the 
regression analysis. On the other hand, if the X level falls far beyond the range of past data, 
extreme caution should be exercised since one cannot be sure that the regression function 
that fits the past data is appropriate over the wider range of the predictor variable. 

4. A statistical test that leads to the conclusion that ^ 0 does not establish a cause- 
and-effect relation between the predictor and response variables. As we noted in Chapter 1, 
with nonexperimental data both theX and Y variables maybe simultaneously influenced by 
other variables notin the regression model. On the other hand, the existence of a regression 
relation in controlled experiments is often good evidence of a cause-and-effect relation. 

5. We should note again that frequently we wish to estimate several mean responses 
or predict several new observations for different levels of the predictor variable, and that 
special problems arise in this case. The confidence coefficients for the limits (2.33) for 
estimating a mean response and for the prediction limits (2.36) for a new observation apply 



78 PartOne Simple Unear Regression 


only for a single level of X for a given sample. In Chapter 4, we discuss how to make 
multiple inferences from a given sample. 

6. Finally, when observations on the predictor variable X are subject to measurement 
errors, the resulting parameter estimates are generally no longer unbiased. In Chapter 4, we 
discuss several ways to handle this situation. 


2.11 Normal Correlation Models 


Distinction between Regression and Correlation Model 

The normal error regression model (2.1), which has been used throughout this chapter 
and which will continue to be used, assumes that the X values are known constants. As a 
consequence of this, the confidence coefficients and risks of errors refer to repeated sampling 
when the X values are kept the same from sample to sample. 

Frequently, it may not be appropriate to consider the X values as known constants. For 
instance, consider regressing daily bathing suit sales by a department store on mean daily 
temperature. Surely, the department store cannot control daily temperatures, so it would not 
be meaningful to think of repeated sampling where the temperature levels are the same from 
sample to sample. As a second example, an analyst may use a correlation model for the two 
variables “height of person” and “weight of person” in a study of a sample of persons, each 
variable being taken as random. The analyst might wish to study the relation between the 
two variables or might be interested in making inferences about weight of a person on the 
basis of the person’s height, in making inferences about height on the basis of weight, or in 
both. 

Other examples where a correlation model, rather than a regression model, may be 
appropriate are: 


1. To study the relation between service station sales of gasoline, and sales of auxiliary 
products. 

2. To study the relation between company net income determined by generally accepted 
accounting principles and net income according to tax regulations. 

3. To study the relation between blood pressure and age in human subjects. 

The correlation model most widely employed is the normal correlation model. We discuss 
it here for the case of two variables. 


Bivariate Normal Distribution 

The normal correlation model for the case of two variables is based on the bivariate normal 
distribution. Let us denote the two variables as Y\ and Y 2 . (We do not use the notation X and 
Y here because both variables play a symmetrical role in correlation analysis.) We say that 
Y\ and V 2 are jointly normally distributed if the density function of their joint distribution 
is that of the bivariate normal distribution. 




Chapter 2 Inferences in Regression and Correlation Analysis 79 



Density Function. The density function of the bivariate normal distribution is as follows: 


f(Y u Y 2 ) 


Yi ~ 


- pi2 


2 (! ~ P12) 


2pi2l 


Yi ~ f Y 2 ~ 111 


Y 2 ~^2 


(2.74) 


Note that this density function involves five parameters: m 1 ， M 2 ，， 巧 ， Pi 2 _ We shall explain 
the meaning of these parameters shortly. First, let us consider a graphic representation of 
the bivariate normal distribution. 

Figure 2.10 contains a SYSTAT three-dimensional plot of a bivariate normal probability 
distribution. The probability distribution is a surface in three-dimensional space. For every 
pair of (Fi, Y- 2 ) values, the density f{Y\, Y 2 ) represents the height of the surface at that 
point. The surface is continuous, and probability corresponds to volume under the surface. 

Marginal Distributions. If and Y 2 are jointly normally distributed, it can be shown 
that their marginal distributions have the following characteristics: 


The marginal distribution of is normal with mean /ai 
and standard deviation ai ： 


MYy) 


y/2n(j\ 


l fYi-jj.i 
21 a, 


(2 ： .75a) 


The marginal distribution of Y-i is normal with mean fi 2 
and standard deviation 0 * 2 : 


= / - r - exp 
■y/2ncf2 


Y 2 ~ ^2 


(2.75b) 


Thus, when Y { and are jointly normally distributed, each of the two variables by itself 
is normally distributed. The converse, however，is not generally true; if and Y 2 are each 
normally distributed, they need not be jointly normally distributed in accord with (2.74). 


FIGURE 2.10 

Example of 
Bivariate 
Normal 
Distribution. 




80 Part One Simple Linear Regression 


Meaning of Parameters. The five parameters of the bivariate normal density func¬ 
tion (2.74) have the following meaning: 

1. fji\ and Gy are the mean and standard deviation of the marginal distribution of . 

2. \X 2 and a 2 are the mean and standard deviation of the marginal distribution of Y 2 . 

3. P 12 is the coefficient of correlation between the random variables Yy and Yi- This 
coefficient is denoted by p[Y\, Y 2 } in Appendix A, using the correlation operator notation, 
and defined in (A.25a): 

Pi2 = p{Y l ,Y 2 }=-^- (2.76) 

(7【(72 

Here, and 0 * 2 , as just mentioned, denote the standard deviations of Yi and o x2 

denotes the covariance g{Y\, Y 2 } between Y\ and Y 2 as defined in (A.21): 

^ 12 = <r{Y u Y 2 } = E{{Y X - 妁 ）(R — 从 2)} (2.77) 

Note that g X 2 = cr 2 i and pyi = pi\- . 

If Y\ and Y 2 are independent, a^i = 0 according to (A.28) so that p x % = 0. If ^ and 
Y 2 are positively related — that is, tends to be large when Y 2 is large, or small when 
h is small — cj 2 is positive and so is pn. On the other hand, if Y\ and Y 2 are negatively 
related — that is, Y\ tends to be large when Yi is small, or vice versa —<7【2 is negative and so 
is p\ 2 - The coefficient of correlation P 12 can take on any value between —1 and 1 inclusive. 
It assumes 1 if the linear relation between Y\ and Y 2 is perfectly positive (direct) and —1 if 
it is perfectly negative (inverse). 


Conditional Inferences 

As noted, one principal use of a bivariate correlation model is to make conditional inferences 
regarding one variable, given the other variable. Suppose Y\ represents a service station’s 
gasoline sales and Y 2 its sales of auxiliary products. We may then wish to predict a service 
station’s sales of auxiliary products Y 2 , given that its gasoline sales are Ki = $5,500. 

Such conditional inferences require the use of conditional probability distributions, which 
we discuss next 


Conditional Probability Distribution of Yi ， The density function of the conditional 
probability distribution of Y x for any given value of Y 2 is denoted by f(Yi |1^) and defined 
as follows: 


f(Yi\Y 2 ) 


fiY u Y 2 ) 


(2.78) 


where f(Yi, F 2 ) is the joint density function of Yi and Y2, and fz^) is the marginal density 
function of F 2 - When Y\ and Y 2 are jointly normally distributed according to (2.74) so that 
the marginal density function fi^) is given by (2.75b), it can be shown that: 


The conditional probability distribution of Y { for any given 
value of Y 2 is normal with mean ct \\2 + ^nY 2 and standard 
deviation and its density function is: 


(2.79) 


f(Y l \Y 2 ) = 


1 




exp 


2 


Y l -a U2 - ^12^2 

0"l|2 


2i 



Chapter 2 Inferences in Regression and Correlation Analysis 81 


The parameters a { \ 2 , and o \\2 of the conditional probability distributions of Y\ are 
functions of the parameters of the joint probability distribution (2.74), as follows: 


0*1 

« L|2 = ^1 — ^2pl2 — 

<^2 

(2.80a) 

Gi 

^12 = Pl2 — 

(2.80b) 

0*2 

^1\2 = c7 f( 1 2 ~Pl2) 

(2.80c) 


The parameter «j |2 is the intercept of the line of regression of Y\ on Y 2 , and the parameter 
^12 is the slope of this line. Thus we find that the conditional distribution of Y u given Y 2 , is 
equivalent to the normal error regression model (1.24). 

Conditional Probability Distributions of F 2 . The random variables Y l and Y 2 p^y sym¬ 
metrical roles in the bivariate normal probability distribution (2.74). Hence, it follows: 


The conditional probability distribution of Y 2 for any given 
value of Yi is normal with mean 0^1 + ^ 2 iY f and standard 
deviation a ； 2 | i and its density function is: 


/(^IFO 


1 


y/l7ZG2\\ 


exp 


2 


Y 2 — <^2|l — 
0"2|1 


2i 


(2.81) 


The parameters « 2 |i ， 亡 21 ， and o%\\ of the conditional probability distributions of Y% are 
functions of the parameters of the joint probability distribution (2.74), as follows: 


a 2 

«2|1 = M2 一 MlPl2 一 

0*1 

(2.82a) 

„ 0*2 
^2i = Pn — 

(2.82b) 

o-i 


a 2 n = - Pn) 

(2.82c) 


Important Characteristics of Conditional Distributions. Three important characteris¬ 
tics of the conditional probability distributions of Y\ are normality, linear regression, and 
constant variance. We take up each of these in turn. 

1. The conditional probability distribution of Y\ for any given value of F 2 is normal. 
Imagine that we slice a bivariate normal distribution vertically at a given value of Y 2 , say, 
at Yhi - That is, we slice it parallel to the axis. This slicing is shown in Figure 2.11. The 
exposed cross section has the shape of a normal distribution, and after being scaled so that 
its area is 1, it portrays the conditional probability distribution of Yi, given that Y 2 = Y h2 . 

This property of normality holds no matter what the value Yhi is. Thus, whenever we 
slice the bivariate normal distribution parallel to the Y x axis, we^obtain (after proper scaling) 
a normal conditional probability distribution. 

2. The means of the conditional probability distributions of Y x fall on a straight line, and 
hence are a linear function of Y 2 : 


E{Y l \Y 2 }=a n2 + ^2Y 2 


( 2 . 83 ) 



82 Part One Simple Linear Regression 


FIGURE 2.11 

Cross Section 
of Bivariate 
Normal 
Distribution 
at y fc2 . 





Here, a "2 is the intercept parameter and the slope parameter. Thus, the relation between 

the conditional means and Y 2 is given by a linear regression function. 

3. All conditional probability distributions of have the same standard deviation 0 ^ 2 . 
Thus, no matter where we slice the bivariate normal distribution parallel to the Y x axis, 
the resulting conditional probability distribution (after scaling to have an area of 1 ) has the 
same standard deviation. Hence, constant variances characterize the conditional probability 
distributions of Yy. 

Equivalence to Normal Error Regression Model. Suppose that we select a random 
sample of observations (y ls Y 2 ) from a bivariate normal population and wish to make 
conditional inferences about Y\, given Y 2 . The preceding discussion makes it clear that the 
normal error regression model (1.24) is entirely applicable because: 

1. The Yy observations are independent. 

2. The Yi observations when Y 2 is considered given or fixed are normally distributed with 
mean E{Yy\Y 2 ] = a\\i + j 6 i 2^2 and constant variance of| 2 . 

Use of Regression Analysis. In view of the equivalence of each of the conditional bivariate 
normal correlation models (2.81) and (2.79) with the normal error regression model (1.24), 
all conditional inferences with these correlation models can be made by means of the 
usual regression methods. For instance, if a researcher has data that can be appropriately 
described as having been generated from a bivariate normal distribution and wishes to make 
inferences about Y 2 , given a particular value of Y\, the ordinary regression techniques will 
be applicable. Thus, the regression function of Y 2 on Y x can be estimated by means of (1.12), 
the slope of the regression line can be estimated by means of the interval estimate (2.15), 
a new observation Y 2 , given the value of Yi, can be predicted by means of (2.36), and so 
on. Computer regression packages can be used in the usual manner. To avoid notational 
problems, it maybe helpful to relabel the variables according to regression usage: Y ― Yz, 
X = Fi. Of course, if conditional inferences on Yy for given values of Y 2 are desired, the 
notation correspondences would be: Y = Y lt X = 



Chapter 2 Inferences in Regression and Correlation Analysis 83 


Can we still use regression model (2.1) if Y { and Y 2 are not bivariate normal? It can be 
shown that all results on estimation, testing, and prediction obtained from regression model 
(2.1) apply if Y\ = Y and Y 2 = X are random variables, and if the following conditions 
hold: 

1. The conditional distributions of the Y- t , given X,-, are normal and independent, with 
conditional means 仇 + and conditional variance cr 2 . 

2. The Xi are independent random variables whose probability distribution does not 

involve the parameters jg 【， a 2 _ 

These conditions require only that regression model (2.1) is appropriate for each condi¬ 
tional distribution of Y iy and that the probability distribution of the X,- does not involve the 
regression parameters. If these conditions are met, all earlier results on estimation, testing, 
and prediction still hold even though the X t are now random variables. The major modi¬ 
fication occurs in the interpretation of confidence coefficients and specified risfts of error. 
When X is random, these refer to repeated sampling of pairs of (X；, F f ) values, where the 
X- t values as well as the Y t values change from sample to sample. Thus, in our bathing suit 
sales illustration, a confidence coefficient would refer to the proportion of correct interval 
estimates if repeated samples of n days’ sales and temperatures were obtained and the 
confidence interval calculated for each sample. Another modification occurs in the test’s 
power, which is different when X is a random variable. 

Comments 

1. The notation for the parameters of the conditional correlation models departs somewhat from 
our previous notation for regression models. The symbol a is now used to denote the regression 
intercept. The subscript 112 to a indicates that F, is regressed on V 2 . Similarly, the subscript 2| 1 to a 
indicates that Y 2 is regressed on K,. The symbol p 12 indicates that it is the slope in the regression of ¥1 
on Y 2 , while p 2 i is the slope in the regression of y 2 on K|. Finally, <i 2 \i is the standard deviation of the 
conditional probability distributions of Y 2 for any given Y t , while o\\i is the standard deviation of the 
conditional probability distributions of Y\ for any given Y 2 . 

2. Two distinct regressions are involved in a bivariate normal model, that of Y t on Y 2 when Y 2 is 
fixed and that of V 2 on Y f when Y\ is fixed. In general, the two regression lines are not the same. For 
instance, the two slopes and )621 are the same only if G\ = 02 , as can be seen from (2.80b) and 
(2.82b). 

3. When interval estimates for the conditional correlation models are obtained, the confidence 

coefficient refers to repeated samples where pairs of observations (Fx, Y 2 ) are obtained from the 
bivariate normal distribution. ■ 


Inferences on Correlation Coefficients 

A principal use of the bivariate normal correlation model is to study the relationship between 
two variables. In a bivariate nobnal model, the parameter pa provides information about 
the degree of the linear relationship between the two variables Y\ and Y 2 . 

Point Estimator of pu. The maximum likelihood estimator of P 12 , denoted by r l2 , is 
given by: 


E(^n - Yi)(Y i2 - f 2 ) 

[J：(Yn - f.) 2 E (^2 - ? 2 ) 2 T l/2 


(2.84) 



84 Part One Simple Linear Regression 


Example 


This estimator is often called the Pearson product-moment correlation coefficient. It is a 
biased estimator of P 12 (unless p 12 = 0 or 1)，but the bias is small when n is large. 

It can be shown that the range of is ： 


— 1 < ?"i2 1 


(2.85) 


Generally, values of rn near 1 indicate a strong positive (direct) linear association be¬ 
tween Y\ and Yi whereas values of r^t near —1 indicate a strong negative (indirect) linear 
association. Values of near 0 indicate little or no linear association between Y\ and Y 2 . 


Test whether /> 12 = O. When the population is bivariate normal, it is frequently desired 
to test whether the coefficient of correlation is zero: 


( 2 . 86 ) 


Ho- Pl2 = 0 
H a :p l2 ^0 

The reason for interest in this test is that in the case where Y\ and Y 2 are jointly normally 
distributed, p l2 == 0 implies that Y { and Y 2 are independent. 

We can use regression procedures for the test since "(2.80b) implies that the following 
alternatives are equivalent to those in (2.86): 

Hq ： P12 = 0 
H a : j6i2 ^ 0 

and (2.82b) implies that the following alterimtives are also equivalent to the ones in (2.86): 


(2.86a) 


Ho ： ^21 = 0 
H a : ^21 ^0 


(2.86b) 


It can be shown that the test statistics for testing either (2.86a) or (2.86b) are the same 
and can be expressed directly in terms of 


t* = 

If Hq holds, t* follows the t(n — 2) distribution. The appropriate decision rule to control 
the Type I error at a is: 


— 2 


(2.87) 


If |/*| < — a/2\n — 2), conclude Hq 
If |/*| > — a/2;n — 2), conclude H a 


( 2 , 88 ) 


Test statistic (2.87) is identical to the regression t* test statistic (2.17). 


A national oil company was interested in the relationship between its service station gasoline 
sales and its sales of auxiliary products. A company analyst obtained a random sample of 
23 of its service stations and obtained average monthly sales data on gasoline sales (F t ) 
and comparable sales of its auxiliary products and services {Y 2 ). These data (not shown) 
resulted in an estimated correlation coefficient = .52. Suppose the analyst wished to test 
whether or not the association was positive, controlling the level of significance at a = .05. 
The alternatives would then be: 


Ho ： pn < 0 

Ha ： p n > 0 



Chapter 2 Itrferences in Regression and Correlation Analysis 85 


and the decision rule based on test statistic (2.87) would be: 


If t* < t(l — a;n — 2), conclude Hq 
Iff* > — a\n — 2), conclude H a 

For a = .05, we require f (.95; 21) = 1.721. Since: 


氺 


. 52^/71 
\/l- (.52) 2 


= 2.79 


is greater than 1.721, we would conclude H a , thatpi 2 > 0. The P-value for tiiis test is .006. 

Interval Estimation of p \2 Using the z r Transformation. Because the sampling distri¬ 
bution of 尸 i 2 is complicated when /O 12 ^ 0, interval estimation of /O 12 is usually carried 
out by means of an approximate procedure based on a transformation. This transformation, 
known as the Fisher z tmn^formation，is as follows: ^ 

略 ㈣ ） (2 - 89) 

When n is large (25 or more is a useful rule of thumb), the distribution of z' is approximately 
normal with approximate mean and variance: 

E[z f } = ^\\og e ( 1 -^) (2.90) 

2 \l~ P 12 J 

cr V 卜 (2.91) 

n — 5 


Note that the transformation from ^12 to z' in (2.89) is the same as the relation in (2.90) 
between and E{z'} — Also note that the approximate variance of z 1 is a known 
constant, depending only on the sample size n. 

Table B.8 gives paired values for the left and right sides of (2.89) and (2.90), thus elim¬ 
inating the need for calculations. For instance, if r V 2 or p 12 equals .25, Table B.8 indicates 
that 〆 er < equals .2554, and vice versa. The values on the two sides of the transformation 
always have the same sign. Thus, if ryi or pyi is negative, a minus sign is attached to the 
value in Table B.8. For instance, \fryi = —.25, z r = —.2554. 

Interval Estimate. When the sample size is large (n > 25), the standardized statistic: 


z' 




^{z f } 


is approximately a standard normal variable. Therefore, approximate 1 
for ^ are: , 


P.92) 


a confidence limits 


z , -'±z(l~a/2) ( j{z , } ^ (2.93) 


where zO — a/2) is the (1 — a/2) 100 percentile of the standard normal distribution. The 
l~a confidence limits for 如 are then obtained by transforming the limits on ( by means 
of (2.90). 



86 Part One Simple Linear Regression 


Example 


An economist investigated food purchasing patterns by households in a midwestem city. 
Two hundred households with family incomes between $40,000 and $60,000 were selected 
to ascertain, among other things, the proportions of the food budget expended for beef and 
poultry, respectively. The economist expected these to be negatively related, and wished to 
estimate the coefficient of correlation with a 95 percent confidence interval. Some supporting 
evidence suggested that the joint distribution of the two variables does not depart markedly 
from a bivariate normal one. 

The point estimate of p^i was r { 2 = —.61 (data and calculations not shown). To obtain 
an approximate 95 percent confidence interval estimate, we require: 


z r = —.7089 when r i2 = —.61 


a {z )= 
z(.975)= 


V200 - 3 


=.07125 


1.960 


(from Table B.8) 



Hence, the confidence limits for by (2.93), are 一 ‘7089 士 1.960(.07125), and the approx¬ 
imate 95 percent confidence interval is: 

— .849 < ^ < —.569 

Using Table B.8 to transform back to we obtain: 


— -69 ^ P 12 ^ — .51 

This confidence interval was sufficiently precise to be useful to the economist, confirming 
the negative relation and indicating that the degree of linear association is moderately high. 


Comments 

1. As usual, a confidence interval for p 12 can be employed to test whether or not p I2 has a specified 
value — say, .5~by noting whether or not the specified value falls within the confidence limits. 

2. It can be shown that the square of the coefficient of correlation, namely pf 2 , measures the 
relative reduction in the variability of Y 2 associated with the use of variable Y\. To see this, we noted 
eariier in (2.80c) and (2.82c) that: 


— 

°\\2 一 

， 2 (1-pf 2 ) 

(2.94a) 


=a 2 2 (l —pf 2 ) 

(2.94b) 

We can rewrite these expressions as follows: 



Pn ' 

— 

(2.95a) 

Pn : 


(2.95b) 


The meaning of p\ 2 is now clear. Consider first (2.95a). p\ 2 measures how much smaller relatively is 
the variability in the conditional distributions of Y t , for any given level of Y 2 , than is the variability 
in the maiginal distribution of Y\. Thus, p^ 2 measures the relative reduction in the variability of F, 
associated with the use of variable Y 2 . Correspondingly, (2.95b) shows that pf 2 also measures the 
relative reduction in the variability of Y 2 associated with the use of variable Yi. 



It can be shown that: 


Chapter 2 Inferences in Regression and Correlation Analysis 87 


0<p 2 l2 < l (2.96) 

The limiting value p^ 2 = 0 occurs when Yi and V 2 are independent, so that the variances of each 
variable in the conditional probability distributions are then no smaller than the variance in the 
marginal distribution. The limiting value p\ 2 = 1 occurs when there is no variability in the conditional 
probability distributions for each variable, so perfect predictions of either variable can be made from 
the other. 

3. The interpretation of p^ 2 as measuring the relative reduction in the conditional variances as 
compared with the marginal variance is valid for the case of a bivariate normal population, but not 
for many other bivariate populations. Of course, the interpretation implies nothing in a causal sense. 

4. Confidence limits for p^ 2 can be obtained by squaring the respective confidence limits for p\ 2 , 

provided the latter limits do not differ in sign. ■ 


Spearman Rank Correlation Coefficient 

At times the joint distribution of two random variables and Y 2 differs considerably from 
the bivariate normal distribution (2.74). In those cases, transformations of the variables Y x 
and Y 2 may be sought to make the joint distribution of the transformed variables approx¬ 
imately bivariate normal and thus permit the use of the inference procedures about p l2 
described earlier. 

When no appropriate transformations can be found, a nonparametric rank correlation 
procedure maybe useful for making inferences about the association between Y { and Y 2 . The 
Spearman rank correlation coefficient is widely used for this purpose. First, the observations 
on Y\ (i.e., Yu,, Y nX ) are expressed in ranks from l ton. We denote the rank of Yn by 
Rn ， Similarly, the observations on (i.e., F 12 ,..., F„ 2 ) are ranked, with the rank of 
denoted by /? f 2 . The Spearman rank correlation coefficient, to be denoted by r s , is then 
defined as the ordinary Pearson product-moment correlation coefficient in (2.84) based on 
the rank data: 


XX 尺 ;i — 及 i)( 尺 f 2 - 及 2 ) 

— E (知一 A) 2 E ( 和 2 - 及 2) 2 ] 1/2 


(2.97) 


Here R { is the mean of the ranks R n and R 2 is the mean of the ranks R^. Of course, since 
the ranks and are the integers 1,it follows that — R 2 = (n + 1)/2. 

Like an ordinary correlation coefficient, the Spearman rank correlation coefficient takes 
on values between —1 and 1 inclusive: 


< rs < 1 


(2.98) 


The coefficient r s equals 1 when the ranks for Y\ are identical to those for Y 2 , that is, when 
the case with rank 1 for Y x also has rank 1 for Y 2 , and so on. In that case, there is perfect 
association between the ranks for the two variables. The coefficient rs equals — 1 when the 
case with rank 1 for Y\ has rank n for Y 2 , the case with rank 2 for Y\ has rank w — 1 for 
Y 2 , and so on. In that event, there is perfect inverse association between the ranks for the 
two variables. When there is little, if any, association between the ranks of Y\ and Y 2 , the 
Spearman rank correlation coefficient tends to have a value near zero. 



88 Part One Simple Linear Regression 


Example 


TABLE 2.4 
Data on 
Population and 
Expenditures 
and Their 
Kanks — Sales 
Marketing 
Example. 


The Spearman rank correlation coefficient can be used to test the alternatives: 

Hq ： There is no association between Fiand Y 2 
H a - There is an association between Y { and Y 2 


(2.99) 


A two-sided test is conducted here since H a includes either positive or negative association. 
When the alternative H a is: 


H a : There is positive (negative) association between Yi and Y 2 (2.700) 


an upper-tail (lower-tail) one-sided testis conducted 

The probability distribution of r s under Hq is not difficult to obtain. It is based on the 
condition that, for any ranking of Y u all rankings of Y 2 are equally likely^when there is no 
association between Y\ and Y 2 . Tables have been prepared and are presented in specialized 
texts such as Reference 2.1. Computer packages generally do not present the probability 
distribution of 。 under Hq but give only the two-sided P-value. When the sample size n 
exceeds 10, the test can be carried out approximately by using test statistic (2.87): 


— 2 

V 1 - r s 


(2.70T) 


based on the t distribution with n — 2 degrees of freedom. 


A market researcher wished to examine whether an association exists between population 
size (y【）and per capita expenditures for a new food product (72). The data for a random 
sample of 12 test markets are given in Table 2.4, columns 1 and 2. Because the distributions of 
the variables do not appear to be approximately normal, a nonparametric test of association 
is desired. The ranks for the variables are given in l^ble 2.4, columns 3 and 4. A computer 
package found that the coefficient of simple correlation between the ranked data in columns 
3 and 4 is 作 =.895. The alternatives of interest are the two-sided ones in (2.99). Since n 


(1) 

Test Population 

Market (in thousands) 
i Yn 


' -:U. •人 

(2) m m 

Per Capita 
Expenditure 
(dollars) 



1 

29 

2 

435 

3 

86 

4 

1,090 

5 

219 

6 

503 

7 

47 

8 

3 , 5^4 

9 

- ；.• - 

185 

10 

98 

11 

952 

12 

89 


127 

1 

2 

214 

8 

11 

133 

,3 

4 

208 

11 

% 

1S3 

7 

6 

184 


8 

130 


3 

217 

12 

12 

141 

6 

5 

15^ 

5 

9 

194 

10 

¥ 

103 

膏 

i 




Chapter 2 Irtferences in Regression and Correlation Analysis 89 


Cited 

References 


Problems 


exceeds 10 here, we use test statistic (2.101): 


氺 


•895V12 —2 
x/l- (-895) 2 


= 6.34 


Fora = .01, we require f(.995; 10) = 3.169. Since \t*\ = 6.34 > 3.169, we conclude H ai 
that there is an association between population size and per capita expenditures for the food 
product. The two-sided P-value of the test is .00008. 


Comments 

1. In case of ties among some data values, each of the tied values is given the average of the ranks 
involved. 

2. It is interesting to note that had the data in Table 2.4 been analyzed by assuming the bivariate 
normal distribution assumption (2.74) and test statistic (2.87), then the strength of the association 
would have been somewhat weaker. In particular, the Pearson product-moment correlation coefficient 
is 广卩 =.674, with t* — .674VlO / -^1 — (.674) 2 = 2.885. Our conclusion would hkve been to 
conclude Hq, that there is no association between population size and per capita expenditures for the 
food product The two-sided 尸 -value of the test is .016. 

3. Another nonparametric rank procedure similar to Spearman’s r_y is Kendall's r. This statistic 

also measures how far the rankings of Yi and Y 2 differ from each other, but in a somewhat different 
way than the Spearman rank correlation coefficient. A discussion of Kendall’s r may be found in 
Reference 2.2. ■ 


2.1. Gibbons, J. D ‘ Nonparametric Methods for Quantitative Analysis. 2nd ed. Columbus, Ohio: 
American Sciences Press, 1985 ‘ 

2.2. Kendall, M.G., and J. D. Gibbons. Rank Correlation Methods. 5th ed. London: Oxford University 
Press, 1990. 


2.1. A student working on a summer internship in the economic research department of a laige 
corporation studied the relation between sales of a product (F,in million dollars) and population 
(X, m million persons) in the firm’s 50 marketing districts. The normal error regression model 
(2.1) was employed. The student first wished to test whether or not a linear association between 
¥ and X existed. The student accessed a simple linear regression program and obtained the 
following information on the regression coefficients: ' 

95 Percent 

Parameter Estimated Value Confidence Limits - 

Intercept 7.43119 -1.18518 16.0476 

Slope >55048 .452886 1.05721 

a* The student concluded from these results that there is a linear association between Y and 
X. Is the conclusion warranied? What is the implied level of significance? 

b. Someone questioned the negative lower confidence limit for the intercept, pointing out that 
dollar sales cannot be negative even if the population in a district is zero. Discuss. 

2.2* In a test of the alternatives Hq ： <0 versus H a \ j6 r > 0, an analyst concluded Hq. Does this 
conclusion imply that there is no linear association between X and Y? Explain. 





90 Part One Simple Linear Regression 


2.3. A member of a student team playing an interactive marketing game received the following 

computer output when studying the relation between advertising expenditures (X) and sales 

(Y) for one of the team’s products: 

Estimated regression equation: Y = 350.7 — .18X 
Two-sided Z 5 -value for estimated slope: .91 

The student stated: t£ The message I get here is that the more we spend on advertising this 

product, the fewer units we sell!” Comment. 

2.4. Refer to Grade point average Problem 1.19. 

a. Obtain a 99 percent confidence interval for Interpret your confidence interval. Does it 
include zero? Why might the director of admissions be interested in whether^tjie confidence 
interval includes zero? 

b. Test, using the test statistic t*, whether or not a linear association exists between student’s 
ACT score (X) and GPA at the end of the freshman year (Y). Use a level of significance of 
.01. State the alternatives, decision rule, and conclusion. 

c. What is the P-value of your test in part (b)? How does it support the conclusion reached in 
part (b)? 

*2.5. Refer to Copier maintenance Problem 1.20. 

a. Estimate the change in the mean service time when the number of copiers serviced increases 
by one. Use a 90 percent confidence interval. Interpret your confidence interval. 

b. Conduct a t test to determine whether or not there is a linear association between X and Y 
here; control the a risk at .10. State the alternatives, decision rule, and conclusion. What is 
the 尸 -value of your test? 

c. Are your results in parts (a) and (b) consistent? Explain. 

d. The manufacturer has suggested that the mean required time should not increase by more 
than 14 minutes for each additional copier that is serviced on a service call. Conduct a test to 
decide whether this standard is being satisfied by Tri-City ‘ Control the risk of a Type I error 
at .05. State the alternatives, decision rule, and conclusion. What is the P-value of the test? 

e‘ Does bo give any relevant information here about the “start-up” time on calls — i.e., about 
the time required before service work is begun on the copiers at a customer location? 

*2.6. Refer to Airfreight breakage Problem 1.21. 

a. Estimate pi with a 95 percent confidence interval. Interpret your interval estimate. 

b. Conduct a f test to decide whether or not there is a linear association between number of times 
a carton is transferred (X) and number of broken ampules (7). Use a level of significance 
of .05. State the alternatives, decision rule, and conclusion. What is the P-value of the test? 

c. represents here the mean number of ampules broken when no transfers of the shipment 
are made — i.e., when X = 0. Obtain a 95 percent confidence interval for and interpret it. 

d. A. consultant has suggested, on the basis of previous experience, thar the mean number of 
broken ampules should not exceed 9.0 when no transfers are made. Conduct an appropriate 
test, using a = .025. State the alternatives, decision rule, and conclusion. What is the 
尸 -value of the test? 

e. Obtain the power of your test in part (b) if actually — 2.0. Assume a[b^} — .50. Also 
obtain the power of your test in part (d) if actually = 11. Assume cr{b Q } = .75. 

2.7 Refer to Plastic hardness Problem 1.22. 

a. Estimate the change in the mean hardness when the elapsed time increases by one hour. Use 
a 99 percent confidence interval. Interpret your interval estimate. 



Chapter 2 Inferences in Regression and Correlation Analysis 91 


b. The plastic manufacturer has stated that the mean hardness should increase by 2 Brinell 
units per hour. Conduct a two-sided test to decide whether this standard is being satisfied; 
use a = .01 ‘ State the alternatives, decision rule, and conclusion. What is the 尸 -value of 
the test? 

c. Obtain the power of your test in part (b) if the standard actually is being exceeded by 
3 Brinell units per hour. Assume cr{b { ] = .1. 

2.8. Refer to Figure 2.2 for the Toluca Company example. A consultant has advised that an increase 
of one unit in lot size should require an increase of 3.0 in the expected number of work hours 
for the given production item ‘ 

a. Conduct a test to decide whether or not the increase in the expected number of work hours 
in the Toluca Company equals this standard. Use a = .05. State the alternatives, decision 
rule, and conclusion. 

b. Obtain the power of your test in part (a) if the consultant's standard actually is being exceeded 
by .5 hour. Assume a{bi\ = .35. 

c. Why is F* = 105.88, given m the printout, not relevant for the test in part (a)? 

2.9. Refer to Figure 2.2. A student, noting that5{fcx} is furnished in the printout, asks why 5{ 匕 } is 
not also given ‘ Discuss. 

2.10. For each of the following questions, explain whether a confidence interval for a mean response 
or a prediction interval for a new observation is appropriate. 

a. What will be the humidity level in this greenhouse tomorrow when we set the temperature 
level at 31°C? 

b. How much do families whose disposable income is $23,500 spend, on the average, for meals 
away from home? 

c. How many kilowatt-hours of electricity will be consumed next month by commercial and 
industrial users m the Twin Cities service area, given that the index of business activity for 
the area remains at its present level? 

2.11. A person asks if there is a difference between the “mean response at X = X h ” and the “mean 
of m new observations at X — Reply. 

2.12. Can <j 2 {pred} in (2.37) be brought increasingly close to 0 as n becomes large? Is this also the 
case for (T 2 [Yh) in (2.29b)? What is the implication of this difference? 

2.13. Refer to Grade point average Problem 1.19. 

a. Obtain a 95 percent interval estimate of the mean freshman GPA for students whose ACT 
test score is 28. Interpret your confidence interval. 

b ‘ Mary Jones obtained a score of 28 on the entrance test. Predict her freshman GPA'using a 
95 percent prediction interval. Interpret your prediction interval. 

c. Is the prediction interval in part (b) wider than the confidence interval in part (a)? Should it 

be? ^ 

d. Determine the boundary values 6f the 95 percent confidence band for the regression line 
when X h = 28. Is your-confidence band wider at this point than the confidence interval in 
part (a)? Should it be? 

*2.14. Refer to Copier maintenance Problem 1.20. 

a. Obtain a 90 percent confidence interval for the mean Service time on calls in which six 
copiers are serviced. Interpret your confidence interval. 

b. Obtain a 90 percent prediction interval for the service time on the next call in which six 
copiers are serviced. Is your prediction interval wider than the corresponding confidence 
interval in part ⑻？ Should it be? 



92 Part One Simple Linear Regression 


c. Management wishes to estimate the expected service time per copier on calls in which six 
copiers are serviced. Obtain an appropriate 90 percent confidence interval by converting the 
interval obtained in part (a). Interpret the converted confidence interval. 

d. Determine the boundary values of the 90 percent confidence band for the regression line 
when Xf, = 6. Is your confidence band wider at this point than the confidence interval in 
part (a)? Should it be? 

*2.15. Refer to Airfreight breakage Problem 1.21. 

a. Because of changes in airline routes, shipments may have to be transferred more frequently 
than in the past. Estimate the mean breakage for the following numbers of transfers: X = 2, 
4. Use separate 99 percent confidence intervals. Interpret your results. 

b. The next shipment will entail two transfers. Obtain a 99 percent prediction interval for the 
number of broken ampules for this shipment. Interpret your prediction intep/^l 

c. In the next several days, three independent shipments will be made, each entailing two 
transfers. Obtain a 99 percent prediction interval for the mean number of ampules broken in 
the three shipments. Convert this interval into a 99 percent prediction interval for the total 
number of ampules broken in the three shipments. 

d. Determine the boundary values of the 99 percent confidence band for the regression line 
when X* = 2 and when Xh = 4. Is your confidence band wider at these two points than the 
corresponding confidence intervals in part (a)? Should it be? 

2.16. Refer to Plastic hardness Problem 1.22. 

a Obtain a 98 percent confidence interval for the mean hardness of molded items with an 
elapsed time of 30 hours. Interpret your confidence interval. 

b. Obtain a 98 percent prediction interval for the hardness of a newly molded test item with 
an elapsed time of 30 hours. 

c. Obtain a 98 percent prediction interval for the mean hardness of 10 newly molded test items, 
each with an elapsed time of 30 hours. 

d. Is the prediction interval in part (c) narrower than the one in part (b)? Should it be? 

e. Determine the boundaiy values of the 98 percent confidence band for the regression line 
when X h = 30. Is your confidence band wider at this point than the confidence interval m 
part (a)? Should it be? 

2.17. An analyst fitted normal error regression model (2.1) and conducted an F test of = 0 versus 

)6| 0. The P-value of the test was .033, and the analyst concluded H a : Pi ^ 0. Was the a 

level used by the analyst greater than or smaller than .033? If the a level had been .01, what 
would have been the appropriate conclusion? 

2.18. For conducting statistical tests concerning the parameter why is the t test more versatile 
than the F test? 

2.19. When testing whether or not P\ = 0, why is the F testa one-sided test even though H a includes 
both )6| < 0 and P\ > 0? [Hint: Refer to (2.57).] 

2.20. A student asks whether R 2 is a point estimator of any parameter in the normal error regression 
model (2.1). Respond. 

2.21. A value of R 2 near 1 is sometimes interpreted to imply that the relation between Y and X is 
sufficiently close so that suitably precise precfictions of Y can be made from knowledge of X. 
Is this implication a necessary consequence of the definition of R 2 ? 

2.22. Using the normal error regression model (2.1) in an engineering safety experiment, a researcher 
found for the first 10 cases that R 2 was zero. Is it possible that for the complete set of 30 cases 
R 2 will not be zero? Could R 2 not be zero for the first 10 cases, yet equal zero for all 30 cases? 
Explain. 



Chapter 2 Inferences in Regression and Correlation Analysis 93 


2.23. Refer to Grade point average Problem 1.19. 

a. Set up the ANOVA table. 

b. What is estimated by MSRm your ANOVA table? by MSE1 Under what condition do MSR 
and MSE estimate the same quantity? 

c. Conduct an F test of whether or not Pi = 0. Control the a risk at .01. State the alternatives, 
decision rule, and conclusion. 

d. What is the absolute magnitude of the reduction in the variation of Y when X is introduced 
into the regression model? What is the relative reduction? What is the name of the latter 
measure? 

e. Obtain r and attach the appropriate sign. 

f. Which measure, R 2 or r, has the more clear-cut operational interpretation? Expl^n. 

*2.24. Refer to Copier maintenance Problem 1.20. 

a. Set up the basic ANOVA table in the format of Table 2.2. Which elements of your table are ad¬ 
ditive? Also set up the ANOVA table in the format of Table 2.3. How do the two tab^s differ? 

b. Conduct an F test to determine whether or not there is a linear association between time 
spent and number of copiers serviced; use a = .10. State the alternatives, decision rule, and 
conclusion. 

c. By how much, relatively, is the total variation in number of minutes spent on a call reduced 
when the number of copiers serviced is introduced into the analysis? Is this a relatively small 
or large reduction? What is the name of this measure? 

d. Calculate r and attach the appropriate sign. 

e. Which measure, r or R 2 , has the more clear-cut operational interpretation? 

*2.25. Refer to Airfreight breakage Problem 1.21. 

a. Set up the ANOVA table. Which elements are additive? 

b. Conducr an F test to decide whether or not there is a linear association between the number 
of times a carton is transferred and the number of broken ampules; control the a risk at .05. 
State the alternatives, decision rule, and conclusion. 

c. Obtain the t* statistic for the test in part (b) and demonstrate numerically its equivalence to 
the F* statistic obtmned in part (b). 

d. Calculate R 2 and r. What proportion of the variation in Y is accounted for by introducing 
~X into the regression model? 

2.26. Refer to I^astic hardness Problem 1.22. 

a. Set up the ANOVA table. 

b. Test by means of an F test whether or not there is a linear association between the hardness 
of the plastic and the elapsed time. Use a = .01. State the alternatives, decision rule, and 
conclusion. 

八 A> 

c. Plot the deviations Y ; — Y { agmnst Xj on a graph. Plot the deviations Yj — Y against X ; 

on another graph, using the same scales as for the first graph. From your two graphs, does 
SSE or SSR appear to be the larger component of SSTO^ What does this imply about the 
magnitude of J? 2 ? * 

d. Calculate R 2 and r. 

• a 

*2.27. Refer to Muscle mass Problem 1.27. 

a. Conduct a test to decide whether or not there is a negative linear association between amount 
of muscle mass and age. Control the risk of Type I error at .05. State the alternatives, decision 
rule, and conclusion. What is the P-value of the test? 



94 Part One Simple Linear Regression 


b. The two-sided P-value for the test whether j6o = 0 is 0+. Can it now be concluded 
that b Q provides relevant information on the amount of muscle mass at birth for a female 
child? 

c. Estimate with a 95 percent confidence interval the difference m expected muscle mass for 
women whose ages differ by one year. Why is it not necessary to know the specific ages to 
make this estimate? 

*2.28. Refer to Muscle mass Problem 1.27. 

a. Obtain a 95 percent confidence interval for the mean muscle mass for women of age 60. 
Interpret your confidence interval. 

b. Obtain a 95 percent prediction interval for the muscle mass of a woman whose age is 60. Is 
the prediction interval relatively precise? 

c. Determine the boundary values of the 95 percent confidence band for-dle regression line 
when X h = 60. Is your confidence band wider at this point than the confidence interval in 
part (a)? Should it be? 

*2.29. Refer to Muscle mass Problem 1.27. 

a. Plot the deviations Y,- — % against X { on one graphs Plot the deviations % — f against X, 
on another graph, using the same scales as in the first graph. From your two graphs, does 
SSE or SSR appear to be the larger component of SSTO? What does this imply about the 
magnitude of R 2 ! 

b. Set up the ANOVA table. 

c. Test whether or not = 0 using an F test with a = .05. State the alternatives, decision 
rule, and conclusion. 

d ‘ What proportion of the total variation in muscle mass remains “unexplained” when age is 
introduced into the analysis? Is this proportion relatively small or laige? 

e. Obtain R 2 and r. 

2.30. Refer ro Crime rate Problem 1.28. 

a. Test whether or not there is a linear association between crime rate and percentage of high 
school graduates, using a t test with a = .01. State the alternatives, decision rule, and 
conclusion. What is the P-value of the test? 

b ‘ Estimate with a 99 percent confidence interval. Interpret your interval estimate. 

2.31. Refer to Crime rate Problem 1.28 

a. Set up the ANOVA table. 

b. Carry out the test in Problem 2.30a by means of the F test. Show the numerical equivalence 
of the two test statistics and decision rules. Is the 尸 -value for the F test the same as that for 
the t test? 

c. By how much is the total variation in crime rate reduced when percentage of high school 
graduates is introduced into the analysis? Is this a relatively laige or small reduction? 

d. Obtain r. 

2.32. Refer to Crime rate Problems 1.28 and 2.30. Suppose that the test in Problem 2.30a is to be 

carried out by means of a general linear test. 

a. State the full and reduced models. 

b. Obtain (1) SSE{F), (2) SSE ⑻ , (3) df F , (4) df Ri (5) test statistic F* for the general linear 
tesr, (6) decision rule. 

c. Are the test statistic F* and the decision rule for the general linear test numerically equivalent 
to those in Problem 2.30a? 



Chapter 2 Inferences in Regression and Correlation Analysis 95 


2.33. In developing empirically a cost function from observed data on a complex chemical experiment, 
an analyst employed normal error regression model (2.1). was interpreted here as the cost 
of setting up the experiment. The analyst hypothesized that this cost should be $7.5 thousand 
and wished to test the hypothesis by means of a general linear test. 

a. Indicate the alternative conclusions for the test. 

b. Specify the full and reduced models. 

c. Without additional information, can you tell what the quantity df R — df F in test statistic (2.70) 
will equal in the analyst’s test? Explain. 

2.34. Refer to Grade point average Problem 1.19. 

a. Would it be more reasonable to consider the X,- as known constants or as random variables 
here? Explain. 

b. If the were considered to be random variables, would this have any effect on prediction 
intervals for new applicants? Explain. 

2.35. Refer to Copier maintenance Problems 1.20 and 2.5. How would the meaning of the jk)nfidence 
coefficient in Problem 2.5a change if the predictor variable were considered a random variable 
and the conditions on page 83 were applicable? 

2.36. A management trainee in a production department wished to study the relation between weight 
of rough casting and machining time to produce the finished block. The trainee selected castings 
so that the weights would be spaced equally apart in the sample and then observed the corre¬ 
sponding machining times. Would you recommend that a regression or a correlation model be 
used? Explain. 

2.37. A social scientist stated: “The conditions for the bivariate normal distribution are so rarely met 
in my experience that I fed much safer using a regression model.” Comment. 

2.38. A student was investigating from a large sample whether variables Ki and F 2 follow a bivariate 
normal distribution. The student obtained the residuals when regressing Y t on Y 2 , and also 
obtained the residuals when regressing Y 2 on Y u and then prepared a normal probability plot 
for each set of residuals. Do these two normal probability plots provide sufficient information 
for determining whether the two variables follow a bivariate normal distribution? Explain. 

2.39. For the bivariate normal distribution with parameters fii = 50 ,112 = 100, q = 3 ,02 = 4, and 
P 12 = .80. 

a. State the characteristics of the maiginal distribution of Y t . 

b. State the characteristics of the conditional distribution of Yi when K, = 55. 

c. State the characteristics of the conditional distribution of Fi when Y 2 = 95. 

2.40. Explain whether any of the following would be affected if the bivariate normal model (2.74) 
were employed instead of the normal error regression model (2.1) with fixed levels of the 
predictor variable: (1) point estimates of the regression coefficients, (2) confidence limits for 
the regression coefficients, (3) interpretation of the confidence coefficient. 

2.41. Refer to Plastic hardness Problem 1.22. A student was analyzing these data and received the 

following standard query from the interactive regression and correlation computer package: 
CALCULATE CONFIDENCE INTERVAL FOR POPULATION CORRELATION COEFFI¬ 
CIENT RHO? ANSWER Y OR N. Would a “yes” response lead to meaningful information 
here? Explain. - 办 

*2.42. Property assessments. The data that follow show assessed value for property tax purposes 
(Fi, in thousand dollars) and sales price (Y 2 , in thousand dollars) for a sample of 15 parcels 
of land for industrial development sold recently in “arm’s length” transactions in 过 tax district. 
Assume that bivariate normal model (2.74) is appropriate here. 




96 Part One Simple Linear Regression 


f: 1 2 3 13 14 15 

Y n : 13.9 16.0 10.3 ... 14.9 12.9 15.8 

Y- a \ 28.6 34.7 21.0 ... 35.1 30.0 36.2 

a. Plot the data in a scatter diagram. Does the bivariate normal model appear to be appropriate 
here? Discuss. 

b. Calculate r l2 . What parameter is estimated by r l2 ? What is the interpretation of this 
parameter? 

c. Test whether or not Y t and Y 2 are statistically independent in the population, using test statis¬ 
tic (2.87) and level of significance .01. State the alternatives, decision rule, and conclusion. 

d. To test pi 2 = .6 versus p I2 ^ .6, would it be appropriate to use test statistic (2.87)? 

2.43. Contract profitability. A cost analyst for a drilling and blasting contractoffexamined 84 con¬ 
tracts handled in the last two years and found that the coefficient of correlation between value 
of contract (F x ) and profit contribution generated by the contract (F 2 ) is r 12 = .61. Assume 
that bivariate normal model (2.74) applies. 

a. Test whether or not F! and Y 2 are statistically independent in the population ； use« = .05. 
State the alternatives, decision rule, and conclusion. 

b. Estimate p '2 with a 95 percent confidence interval. 

c. Convert the confidence interval In part (b) to a 95 percent confidence interval for p^ 2 - Interpret 
this interval estimate. 

*2.44. Bid preparation. A building construction consultant studied the relationship between cost of 
bid preparation (F t ) and amount of bid (Y 2 ) for the consulting firm’s clients. In a sample of 
103 bids prepared by clients, r l2 = .87. Assume that bivariate normal model (2.74) applies. 

a. Test whether or not p l2 = 0; control the risk of Type I error at .10. State the alternatives, 
decision rule, and conclusion. What would be the implication if p l2 = 0? 

b. Obtain a 90 percent confidence interval for pn- Interpret this interval estimate. 

c. Convert the confidence interval in part (b) to a 90 percent confidence interval for p^ 2 - 

2.45. Water flow. An engineer, desiring to estimate the coefficient of correlation p i2 between rate 
of water flow at point A in a stream (F,) and concurrent rate of flow at point B (Y 2 ), obtained 
r, 2 — .83 in a sample of 147 cases. Assume that bivariare normal model (2.74) is appropriate. 

a. Obtain a 99 percent confidence interval for p' 2 . 

b. Convert the confidence interval in part (a) to a 99 percent confidence interval for pf 2 . 

2.46. Refer to Property assessments Problem 2.42. There is some question as to whether or not 
bivariate model (2.74) is appropriate. 

a. Obtain the Spearman rank correlation coefficient rs- 

b. Test by means of the Spearman rank correlation coefficient whether an association exists 
between property assessments and sales prices using test statistic (2.101) with o: = .01. 
State the alternatives, decision rule, and conclusion. 

c. How do your estimates and conclusions in parts (a) and (b) compare to those obtained in 
Problem 2.42? 


*2.47. Refer to Muscle mass Problem 1.27. Assume that the normal bivariate model (2.74) is 
appropriate. 

a. Compute the Pearson product-moment correlation coefficient r\ 2 - 

b. Test whether muscle mass and age are statistically independent in the population; use 
o; = .05. State the alternatives, decision rule, and conclusion. 




Chapter 2 Inferences in Regression and Correlation Analysis 97 


c. The bivariate normal model (2.74) assumption is possibly inappropriate here. Compute the 
Spearman rank correlation coefficient, r s . 

d. Repeat part (b), this time basing the test of independence on the Spearman rank correlation 
computed in part (c) and test statistic (2.101). Use a = .05. State the alternatives, decision 
rule, and conclusion. 

e. How do your estimates and conclusions in parts (a) and (b) compare to those obtained in 
parts (c) and (d)? 


2.48. Refer to Crime rate Problems 1.28, 2.30, and 2.31. Assume that the normal bivariate model 

(2.74) is appropriate. 

a. Compute the Pearson product-moment correlation coefficient r t 2 . 

b. Test whether crime rate and percentage of high school graduates are statistically independent 
in the population; use a = .01. State the alternatives, decision rule, and conclusion. 

c. How do your estimates and conclusions in parts (a) and (b) compare to those obtained in 

2.31b and 2.30a, respectively? L 

2.49. Refer to Crime rate Problems 1.28 and 2.48. The bivariate normal model (2.74) assumption 

is possibly inappropriate here. 

a. Compute the Spearman rank correlation coefficient rs- 

b. Test by means of the Spearman rank correlation coefficient whether an association exists 
between crime rate and percentage of high school graduates using test statistic (2.101) and 
a level of significance .01. State the alternatives, decision rule, and conclusion. 

c. How do your estimates and conclusions in parts (a) and (b) compare to those obrained in 
Problems 2.48a and 2.48b, respectively? 


Exercises 


2.50. Derive the property in (2.6) for the ki. 

2.51. Show that bo as defined in (2.21) is an unbiased estimator of /So. 

2.52. Derive the expression in (2.22b) for the variance of b Q , making use of (2.31). Also explain how 
variance (2.22b) is a special case of variance (2.29b). 

2.53. (Calculus needed.) 

a. Obtain the likelihood function for the sample observations ^,..., given Xi,X„, if 
the conditions on page 83 apply. 

b. Obtain the maximum likelihood estimators of /So, j3f, and a 2 . Are the estimators of Pq and 

the same as those in (1.27) when the X/ are fixed? 

2.54. Suppose that normal error regression model (2.1) is applicable except that the error variance 

is not constant; rather the variance is larger, the larger is X. Does jB, = 0 still imply that there 
is no linear association between X and K? That there is no association between X and Y? 
Explain. , 

2.55. Derive the expression for SSR in (2.51). 

2.56. In a small-scale regression study, five observations on Y were obtained corresponding to X = 1, 

4, 10, 11, and 14. Assume that a = .6, = 5, and ^ = 3. 

a. What are the expected values df MSR and MSE here? ^ 

b. For determining whether or not a regression relation exists, would it have been better or 
worse to have made the five observations at X = 6, 7, 8, 9, and 10? Why? Would the 
same answer apply if the principal purpose were to estimate the mean response for X = 8? 
Discuss. 




98 Part One Simple Linear Regression 


2.57. The normal error regression model (2.1) is assumed to be applicable. 

a. When testing H 0 : j 6 | = 5 versus H a ; ^ 5 by means of a general linear test, what is the 

reduced model? What are the degrees of freedom df R ? 

b. When testing Hq ： 仇 = 2, 氏 — 5 versus H a : not both = 2 and 卢 i = 5 by means of a 
general linear test, what is the reduced model? What are the degrees of freedom df R ? 

2.58. The random variables Y f and I 2 follow the bivariate normal distribution in (2.74). Show that if 

p f2 — 0 , Y\ and Y 2 are independent random variables. 

2.59. (Calculus needed.) 

a Obtain the maximum likelihood estimators of the parameters of the bivariate normal distri¬ 
bution in (2.74). 

b. Using the results in part (a), obtain the maximum likelihood estimators o^the parameters of 
the conditional probability distribution of Y t for any value of Y 2 in (2.80). 

c. Show that the maximum likelihood estimators of a t \ 2 and obtained in part (b) are the 
same as the least squares estimators ( 1 . 10 ) for the regression coefficients in the simple linear 
regression model. 

2.60. Show that test statistics (2.17) and (2.87) are equivalent. 

2.61. Show that the ratio SSR/SSTO is the same whether K, is regressed on Y 2 or Y 2 is regressed on 

Y t . [Hint: Use (1.10a) and p.51).] 


Projects 


2.62. Refer to the CDI data set in Appendix C.2 and Project 1.43. Using R 2 as the criterion, which 
predictor variable accounts for the largest reduction in the variability in the number of active 
physicians? 

2.63. Refer to the CDI data set in Appendix C.2 and Project 1.44. Obtain a separate interval estimate 
of for each region. Use a 90 percent confidence coefficient in each case. Do the regression 
lines for the different regions appear to have similar slopes? 

2.64. Refer to the SENIC data set in Appendix C. 1 and Project 1.45. Using R 2 as the criterion, which 
predictor variable accounts for the largest reduction in the variability of the average length of 
stay? 

2.65. Refer to the SENIC data set in Appendix C.l and Project 1.46. Obtain a separate interval 
estimate of for each region. Use a 95 percent confidence coefficient in each case. Do the 
regression lines for the different regions appear to have similar slopes? 

2.66. Five observations on Y are to be taken when X = 4, 8 , 12, 16, and 20, respectively. The true 
regression function is f'fyj = 20 + 4X, and the e,- are independent N(0 t 25). 

a. Generate five normal random numbers, with mean 0 and variance 25. Consider these random 
numbers as the error terms for the five F observations at X = 4, 8, 12,16, and 20 and calculate 
Yi, y 2t K 3 , Y 4 , and K 5 . Obtain the least squares estimates b 0 and b f when fitting a straight 

八 

line to the five cases. Also calculate Yh when — 10 and obtain a 95 percent confidence 
interval for E{Y h } when X h = 10. 

b. Repeat part (a) 200 times, generating new random numbers each time. 

c. Make a frequency distribution of the 200 estimates b\. Calculate the mean and standard 
deviation of the 200 estimates b\. Are the results consistent with theoretical expectations? 

d. What proportion of the 200 confidence intervals for £{7/,} when X h = 10 include £{}),}? 
Is this result consistent with theoretical expectations? 




Chapter 2 bferences in Regression and Correlation Analysis 99 


2.67. Refer to Grade point average Problem 1.19. 

a. Plot the data, with the least squares regression line for ACT scores between 20 and 30 
superimposed. 

b. On the plot in part (a), superimpose a plot of the 95 percent confidence band for the true 
regression line for ACT scores between 20 and 30. Does the confidence band suggest that 
the true regression relation has been precisely estimated? Discuss. 

2.68. Refer to Copier maintenance Problem 1.20. 

a. Plot the data, with the least squares regression line for numbers of copiers serviced between 
1 and 8 superimposed. 

b. On the plot in part (a), superimpose a plot of the 90 percent confidence band for the true 
regression line for numbers of copiers serviced between 1 and 8. Does the confidence band 
suggest that the true regression relation has been precisely estimated? Discuss. 



Chapter 



Diagnostics and 
Remedial Measures ， 

When a regression model, such as the simple linear regression model (2.1), is considered 
for an application, we can usually not be certain in advance that the model is appropriate 
for that application. Any one, or several, of the features of the model, such as linearity 
of the regression function or normality of the error terms’ may not be appropriate for the 
particular data at hand. Hence, it is important to examine the aptness of the model for the 
data before inferences based on that model are undertaken. In this chapter, we discuss some 
simple graphic methods for studying the appropriateness of a model, as well as some formal 
statistical tests for doing so. We also consider some remedial techniques that can be helpful 
when the data are not in accordance with the conditions of regression model (2.1). We 
conclude the chapter with a case example that brings together the concepts and methods 
presented in this and the earlier chapters. 

While the discussion in this chapter is in terms of the appropriateness of the simple 
linear regression model (2.1), the basic principles apply to all statistical models discussed 
in tiiis book. In later chapters, additional methods useful for examining the appropriateness 
of statistical models and other remedial measures will be presented, as well as methods for 
validating the statistical model. 


3.1 Diagnostics for Predictor Variable_ 

We begin by considering some graphic diagnostics for the predictor variable. We need 
diagnostic information about the predictor variable to see if there are any outlying X values 
that could influence the appropriateness of the fitted regression function. We discuss the 
role of influential cases in detail in Chapter 10. Diagnostic information about the range and 
concentration of the X levels in the study is also useful for ascertaining the range of validity 
for the regression analysis. 

Figure 3.1a contains a simple dot plot for the lot sizes in the Toluca Company example 
in Figure 1.10. A dot plot is helpful when the number of observations in the data set is not 
lai^e. The dot plot in Figure 3.1a shows that the minimum and maximum lot sizes are 20 
and 120, respectively, that the lot size levels are spread throughout this interval, and that 

100 


Chapter 3 Diagnostics and Remedial Measures 101 


+ 


-H — 

100 120 


20 


40 


60 80 
Lot Size 


(c) Stem-and-Leaf Plot (d) Box Plot 

2 0 

3 000 

4 00 

5H 000 _ j -+- - 

6 0 ― 1 - 1 -!-!-!- 1 - 

60 80 1 00 120 

Lot Size 

10 00 
11 00 
12 0 


there are no lot sizes that are far outlying. The dot plot also shows that in a number of cases 
several runs were made for the same lot size. 

A second useful diagnostic for the predictor variable is a sequence plot. Figure 3.1b 
contains a time sequence plot of the lot sizes for the Toluca Company example. Lot size is 
here plotted against production run (i.e.，against time sequence). The points in the plot are 
connected to show more effectively the time sequence. Sequence plots should be utilized 
whenever data are obtained in a sequence，such as over time or for adjacent geogr 呼 hie 
areas. The sequence plot in Figure 3.1b contains no special pattern. If, say, the plot had 
shown that smaller lot sizes had been utilized early on and larger lot sizes later on, this 
information could be very helpful for subsequent diagnostic studies of the aptness of the 
fitted regression model. - - 

Figures 3.1c and 3.Id contain two other diagnostic plots that present information similar 
to the dot plot in Figure 3.1a. The stem-and-leaf plot in Figure 3.1c provides information 
similar to a frequency histogram. By displaying the last digits, this plot also indicates here 
that all lot sizes in the Toluca Company example were multiples of 10. The letter M in the 


7M 000 
8 000 
9H 0000 


20 40 


FIGURE S.1 MINITAB and SYGRAPH Diagnostic Plots for Predictor Variable~Toluca Company Example, 
(a) Dot Plot (b) Sequence Plot 

150.- 






102 Part One Simple Linear Regression 


SYGRAPH output denotes the stem where the median is located, and the letter H denotes 
the stems where the first and third quartiles (hinges) are located. 

The box plot in Figure 3.Id shows the minimum and maximum lot sizes, the first and 
third quartiles, and the median lot size. We see that the middle half of the lot sizes range 
from 50 to 90, and that they are fairly symmetrically distributed because the median is 
located in the middle of the central box. A box plot is particularly helpful when there are 
many observations in the data set. 


3.2 Residuals 


Direct diagnostic plots for the response variable Y are ordinarily not too usefkHn regression 
analysis because the values of the observations on the response variable are a function of 
the level of the predictor variable. Instead, diagnostics for the response variable are usually 
carried out indirectly through an examination of the residuals. 

The residual e, , as defined in (1.16), is the difference between the observed value Y； and 

A. ' 

the fitted value F,-: 

ei = - F ( - (3.1) 

The residual maybe regarded as the observed error, in distinction to the unknown true error 
£/ in the regression model: 

= Vi - 五 ⑺ } (3.2) 

For regression model (2.1), the error terms e, are assumed to be independent normal 
random variables, with mean 0 and constant variance a 2 . If the model is appropriate for the 
data at hand, the observed residuals e { should then reflect the properties assumed for the £；. 
This is the basic idea underlying residual analysis, a highly usejful means of examining the 
aptness of a statistical model. 


Properties of Residuals 

Mean. The mean of the n residuals e, for the simple linear regression model (2.1) is, 
by (1.17): 


巨 = ^=0 (3.3) 

n 

where e denotes the mean of the residuals. Thus, since e is always 0, it provides no infor¬ 
mation as to whether the true errors e, have expected value E{£[] = 0. 

Variance. The variance of the n residuals e ( is defined as follows for regression 
model (2.1): 




SSE 


n 


2 


n — 2 n — 2 


= MSE 


(3.4) 


If the model is appropriate, MSE is, as noted earlier, an unbiased estimator of the variance 
of the error terms a 2 . 

Nonindependence. The residuals e,- are not independent random variables because they 

八 

involve the fitted values F, which are based on the same fitted regression ftmction. As 




Chapter 3 Diagnostics and Remedial Measures 103 


a result, the residuals for regression model (2.1) are subject to two constraints. These 
are constraint (1.17)that the sum of the e,- must be 0and constraint (1.19)that the 
products X；ei must sum to 0. 

When the sample size is large in comparison to the number of parameters in the regression 
model, the dependency effect among the residuals is relatively unimportant and can be 
ignored for most purposes. 


Semistudentized Residuals 

At times, it is helpM to standardize the residuals for residual analysis. Since the standard 
deviation of the error terms ^ is cr，which is estimated by y/MSE, it is natural to consider 
the following form of standardization: 


e 


氺 


€[ _ € 
y/MSE 


ej 

yfMSE 


(3.5) 


_ L 

If y/MSE were an estimate of the standard deviation of the residual e；, we would call ej 

a studentized residual. However, the standard deviation of e t is complex and varies for 
the different residuals e h and y/MSE is only an approximation of the standard deviation 
of e[. Hence, we call the statistic ef in (3.5) a semistudentized, residual. We shall take 
up studentized residuals in Chapter 10. Both semistudentized residuals and studentized 
residuals can be very helpful in identifying outlying observations. 


Departures from Model to Be Studied by Residuals 

We shall consider the use of residuals for examining six important types of departures from 
the simple linear regression model (2.1) with normal errors: 

1. The regression function is not linear. 

2. The error terms do not have constant variance. 

3. The error terms are not independent. 

4. The model fits all but one or a few outlier observations. 

5. The error terms are not normally distributed. 

6. One or several important predictor variables have been omitted from the model. 


3.3 Diagnostics for Residuals_ ， 

We take up now some informal diagnostic plots of residuals to provide information on 
whether any of the six types of departures from the simple linear regression model (2.1) 
just mentioned are present. The following plots of residuals (or semistudentized residuals) 
will be utilized here for this purpose: 1 2 3 4 5 6 7 

1. Plot of residuals against predictor variable. 

2. Plot of absolute or squared residuals against predictor variable. 

3. Plot of residuals against fitted values. - 

4. Plot of residuals against time or otjaer sequence. 

5. Plots of residuals against omitted predictor variables. 

6. Box plot of residuals. 

7. Normal probability plot of residuals. 



104,, Part <Qne Simple Linear Regression 


50 


-100 


\® 










•• 


100 


-50 


0 

Expected 


50 


100 


-70 -35 0 35 70 105 

Residual 


Figure 3.2 contains, for the Toluca Company example, MINITAB and SYGRAPH plots 
of the residuals in Table 1.2 against the predictor variable and against time, a box plot, and 
a normal probability plot. All of these plots, as we shall see, support the appropriateness of 
regression model (2.1) for the data. 

We turn now to consider how residual analysis can be helpM in studying each of the six 
departures from regression model (2.1). 

Nonlinearity of Regression Function 

Whether a linear regression function is appropriate for the data being analyzed can be 
studied from a residual plot against the predictor variable or, equivalently, from a residual 
plot against the fitted values. Nonlinearity of the regression function can also be studied 
from a scatter plot, but this plot is not always as effective as a residual plot. Figure 3.3a 


-50 

100 


• • 










0 


50 


100 


Lot Size 


150 


-50 


-100 



0 


10 


20 


30 


ROn 


(c) Box Plot 


(d) Normal Probability Plot 


150 


FIGURE 3.2 MINITAB and SYGRAPH Diagnostic Residual Plots — Toluca Company Example. 


150 


(a) Residual Plot against X 


(b) Sequence Plot 


150 


\ 


I 


o o o 

0 5 
-I 

lenpjsoy 




Chapter 3 Diagnostics and Remedial Measures 105 


⑶ 

Fitted 

Value 


TABLE 3.1 

Number of 


(1) 

Increase in 

Mapsi 

Distributed 
and Increase in 


Ridership 

Distributed 

City 

* 

/ 

(thousands) 

y ； 

(thousands) 

X； 

Ridership — 

Transit 

1 

.60 

80 

Example. 

2 

6.70 

220 


3 

530 

140 


4 

4,00. 

120 


5 

6,55 

ISO 


6 

2.15 

100 


7 

6.60 

200 


8 

5.75 

160 


= -1.82 H-.0435 


(a) Scatter Plot 



_i_I_ i I 

100 140 180 220 

Maps Distributed (thousands) 


FIGURE 3.3 
Scatter Plot 
and Residual 
Plot 

Illustrating 
Nonlinear 
Regression 
Function —— 



Exanqple* 


(b) Residual Plot 








耱 






J_L 


100 140 180 220 

Maps Distributed (thousands) 


L 


contains a scatter plot of the data and the fitted regression line for a study of the relation 
between maps distributed and bus ridership in eight test cities. Here，X is the number of 
bus transit maps distributed free to residents of the city at the beginning of the test period 
and Y is the increase during the test period in average daily bus ridership during nonpeak 
hours. The original data and fitted values are given in Table 3.1，columns 1, 2, and 3. The 
plot suggests strongly that a linear regression function is not appropriate. 

Figure 3.3b presents a plot—of the residuals, shown in Table 3.1, column 4, against the 
predictor variable X. The lack of fit of the linear regression function is even more strongly 
suggested by the residual plot against X in Figure 3.3b than by the scatter plot. Note that 
the residuals depart from 0 in a systematic fashion; they arejiegative for smaller X values, 
positive for medium-size X values, and negative again for large X values. 

In this case, both Figures 3.3a and 3.3b point out the lack of linearity of the regression 
function. In general, however, the residual plot is to be preferred, because it has some 
important advantages over the scatter plot. First, the residual plot can easily be used for 
examining other facets of the aptness of the model. Second, there are occasions when the 


.66.75.27.40.01.53- 88 .t4 
-T7.4.3. 一 6.2.'6.i5. - 


£ 




Re- 


06050360,54382861 


o 


lenplsd^ 


(spuesno-) dlLjsJ^plyul^seaJUUI 




106 Part One Simple Linear Regression 


FIGURE 3.4 
Prototype 
Residual Plots. 



X 


(a) 


(b) 


e 


0 


X 



(c) 


(d) 


scaling of the scatter plot places the F ； observations close to the fitted values F；, for instance, 
when there is a steep slope. It then becomes more difficult to study the appropriateness of 
a linear regression function from the scatter plot. A residual plot, on the other hand, can 
clearly show any systematic pattern in the deviations around the fitted regression line under 
these conditions. 

Figure 3.4a shows a prototype situation of the residual plot against X when a linear 
regression model is appropriate. The residuals then fall within a horizontal band centered 
around 0, displaying no systematic tendencies to be positive and negative. This is the case 
in Figure 3.2a for the Toluca Company example. 

Figure 3.4b shows a prototype situation of a departure from the linear regression model 
that indicates the need fora curvilinear regression function. Here the residuals tend to vary 
in a systematic fashion between being positive and negative. This is the case in Figure 3.3b 
for the transit example. A different type of departure from linearity would, of course, lead 
to a picture different from the prototype pattern in Figure 3.4b. 

Comment 

A plot of residuals against the fitted values Y provides equivalent information as a plot of residuals 
against X for the simple linear regression model, and thus is not needed in addition to the residual plot 
against X. The two plots provide the same information because the fitted values Y ； area linear function 
of the values X ； for the predictor variable. Thus, only the X scale values, not the basic pattern of the 
plotted points, are affected by whether the residual plot is against the X, or the Y,. For curvilinear 
regression and multiple regression, on the other hand, separate plots of the residuals against the fitted 
values and against the predictor variable(s) are usually helpful. ■ 



Chapter 3 Diagnostics and Remedial Measures 107 


20 r~ 


(a) Residual Plot against X 


(b) Absolute Residual Plot against X 


20 r 


FIGURE 3.5 

Residual Plots 

Illustrating 

Nonconstant 

Error 

Variance. 


Nonconstancy of Error Variance 

Plots of the residuals against the predictor variable or against the fitted values are not only 
helpful to study whether a linear regression function is appropriate but also to examine 
whether the variance of the error terms is constant Figure 3.5a shows a residual plot against 
age for a study of the relation between diastolic blood pressure of healthy, adult women (F) 
and their age (X). The plot suggests that the older the woman is, the more spread out the 
residuals are. Since the relation between blood pressure and age is positive, this suggests 
that the error variance is larger for older women than for younger ones. 

The prototype plot in Figure 3.4a exemplifies residual plots when the error term variance 
is constant. The residual plot in Figure 3.2a for the Toluca Company example is of this type, 
suggesting that the error terms have constant variance here. 

Figure 3.4c shows a prototype picture of residual plots when the error variance increases 
with X. In many business, social science, and biological science applications, dep£utures 
from constancy of the error variance tend to be of the “megaphone” type shovel in Fig¬ 
ure 3.4c, as in the blood pressure example in Figure 3.5a. One can also encounter error 
variances decreasing with increasing levels of the predictor variable and occasionally vary¬ 
ing in some more complex fashion. 

, Plots of the absolute values of the residuals or of the squared residuals against the pre- 

A 

dictor variable X or against the fitted values y are also useful for diagnosing nonconstancy 
of the error variance since the signs of the residuals are not meaningful for examining the 
constancy of the error variance. These plots are especially useful when there are not many 
cases in the data set because plotting of either the absolute or squared residuals places all of 
the information on changing magnitudes of the residuals above the horizontal zero line so 
that one can more readily see whether the magnitude of the residuals (irrespective of sign) 

八 

is changing with the level of X or Y. 

Figure 3.5b contains a plot of the absolute residuals against age for the blood pressure 
example. This plot shows more clearly that the residuals tend to be larger in absolute 
magnitude for older-aged women. 



■ _I_I o 

o 1020 l 

I I 

lenpis- 



108 Part One Simple Unear Regression 


FIGURE 3.6 
Residual Plot 
with Outlier. 


Presence of Outliers 

Outliers are extreme observations. Residual outliers can be identified from residual plots 
against X or Y, as well as from box plots, stem-and~leaf plots, and dot plots of the residu¬ 
als. Plotting of semistudentized residuals is particularly helpful for distinguishing outlying 
observations, since it then becomes easy to identify residuals that lie many standard devi¬ 
ations from zero. A rough rule of thumb when the number of cases is large is to consider 
semistudentized residuals with absolute value of four or more to be outliers. We shall talse 
up more refined procedures for identifying outliers in Chapter 10. 

The residual plot in Figure 3.6 presents semistudentized residuals and contains one 
outlier, which is circled. Note that this residual represents an observation almost six standard 
deviations from the fitted value. 

Outliers can create great difficulty. When we encounter one, our first suspicion is that 
the observation resulted from a mistake or other extraneous effect, and hence should be 
discarded. A major reason for discarding it is that under the least squares method, a fitted 
line may be pulled disproportionately toward an outlying observation because the sum of 
the squared deviations is minimized. This could cause a misleading fit if indeed the outlying 
observation resulted from a mistake or other extraneous cause. On the other hand, outliers 
may convey significant information, as when an outlier occurs because of an interaction 
with another predictor variable omitted from the model. A safe rule frequently suggested is 
to discard an outlier only if there is direct evidence that it represents an error in recording, 
a miscalculation, a malftmctioning of equipment, or a similar type of circumstance. 

Comment 

When a linear regression model is fitted to a data set with a small number of cases and an outlier is 
present, the fitted regression can be so distorted by the outlier that the residual plot may improperly 
suggest a lack of fit of the linear regression model, in addition to flagging the outlier. Figure 3.7 
illustrates this situation. The scatter plot in Figure 3.7a presents a situation where all observations 
except the outlier fall around a straight-line statistical relationship. When a linear regression function 
is fitted to these data, the outlier causes such a shift in the fitted regression line as to lead to a systematic 
pattern of deviations from the fitted line for the other observations, suggesting a lack of fit of the linear 
regression function. This is shown by the residual plot in Figure 3.7b. ■ 

Nonindependence of Error Terms 

Whenever data are obtained in a time sequence or some other type of sequence, such as 
for adjacent geographic areas, it is a good idea to prepare a sequence plot of the residuals. 




Chapter 3 Diagnostics and Remedial Measures 109 


FIGURE 3.7 
Distorting 
Effect on 
Residuals 
Caused by an 
Outlier When 
Remaining 
Data Follow 
Linear 
Regression. 



FIGURE S.8 Residual Time Sequence Plots Illustrating Nonindependence of Error Terms. 


(a) Welding Example Trend Effect (b) Cyclical Nonindependence 



1 3 5 7 9 11 13 1 3 5 7 9 11 13 


Time Order of Weld Time Order 


The purpose of plotting the residuals against time or in some other type of sequence is to 
see if there is any correlation between error terms that are near each other in the sequence. 
Figure 3.8a contains a time sequence plot of the residuals in an experiment to study the 
relation between the diameter of a weld (X) and the shear strength of the weld (F): An 
evident correlation between the error terms stands out. Negative residuals are associated 
mainly with the early trials, and positive residuals with the later trials. Apparently, some 
effect connected with time was present, such as learning by the welder or a gradual change 
in the welding equipment, so the shear strength tended to be greater in the later welds 
because of this effect. 

A prototype residual plot showing a time-related trend effect is presented in Figure 3.4d, 
which portrays a linear time-related trend effect, as in the welding example. It is sometimes 
useful to view the problem of nonindependence of the error terms as one in which an 
important variable (in this case, time) has been omitted from the model. We shall discuss 
this type of problem shortly. 



110 Part One Simple Linear Regression 


Another type of nonindependence of the error terms is illustrated in Figure 3.8b. Here 
the adjacent error terms are also related, but the resulting pattern is a cyclical one with no 
trend effect present. 

When the error terms are independent, we expect the residuals in a sequence plot to 
fluctuate in a more or less random pattern around the base line 0, such as the scattering 
shown in Figure 3.2b for the Toluca Company example. Lack of randomness can take the 
form of too much or too little alternation of points around the zero line. In practice, there is 
little concern with the former because it does not arise frequently. Too little alternation, in 
contrast, frequently occurs, as in the welding example in Figure 3.8a. 

Comment 

When the residuals are plotted against X, as in Figure 3.3b for the transit example, the scatter may not 
appear to be random. For this plot, however, the basic problem is probably not lack of independence 
of the error terms but a poorly fitting regression function. This, indeed, is the situation portrayed in 
the scatter plot in Figure 3.3a. ■ 

Nonnormality of Error Terms 

As we noted earlier, small departures from normality do not create any serious problems. 
Major departures，on the other hand，should be of concern. The normality of the error terms 
can be studied informally by examining the residuals in a variety of graphic ways. 

Distribution Plots. A box plot of the residuals is helpful for obtaining summary informa¬ 
tion about the symmetry of the residuals and about possible outliers. Figure 3.2c contains 
a box plot of the residuals in the Toluca Company example. No serious departures from 
symmetry are suggested by this plot. A histogram, dot plot, or stem-and-leaf plot of the 
residuals can also be helpful for detecting gross departures from normality. However, the 
number of cases in the regression study must be reasonably large for any of these plots to 
convey reliable information about the shape of the distribution of the error terms. 

Comparison of Frequencies. Another possibility when the number of cases is reasonably 
large is to compare actual frequencies of the residuals against expected frequencies under 
normality. For example, one can determine whether, say, about 68 percent of the residuals 
e t jEiall between 士 y/MSE or about 90 percent fall between 土 l.645y/MSE. When the sample 
size is moderately large, corresponding t values may be used for the comparison. 

To illustrate this procedure, we again consider the Toluca Company example of Chapter 1. 
Table 3.2, column 1, repeats the residuals from Table 1.2. We see from Figure 2.2 that 
■y/MSE = 48.82. Using the t distribution，we expect under normality about 90 percent of 
the residuals to fall between 土 f(.95; 23)-y/MSE = 土 1.714(48.82)，or between —83.68 
and 83.68. Actually, 22 residuals, or 88 percent, fall within these limits. Similarly, under 
normality, we expect about 60 percent of the residuals to fall between 一 41.89 and 41.89. 
The actual percentage here is 52 percent. Thus，the actual frequencies here are reasonably 
consistent with those expected under normality. 

Normal Probability Plot. Still another possibility is to prepare a normal probability plot 
of the residuals. Here each residual is plotted against its expected value under normality. 
A plot that is nearly linear suggests agreement with normality，whereas a plot that departs 
substantially from linearity suggests that the error distribution is not normal. 

Table 3.2, column 1, contains the residuals for the Toluca Company example. To find 
the expected values of the ordered residuals under normality, we utilize the facts that (1) 



Chapter 3 Diagnostics and Remedial Measures 111 


TABLE 3.2 
Residuals and 
Expected 
Values under 
Normality — 
Toluca 
Company 
Example. 



⑴ 

(2) 

⑶ 

Expected 

Run 

Residual 

Rank 

Value under 

/ 

e ； 

k 

Normality 

1 

51.02 

22 

51.95 

2 

-48.47 

5 

一 44.10 

3 

-19.88 

10 

^14.76 

W 9 • 

23 

38.83 

19 

31.05 

24 

-5.98 

13 

0 

25 

10.72 

17 

19.93 


I 

the expected value of the error terms for regression model (2.1) is zero and (2) the standard 
deviation of the error terms is estimated by -y/MSE. Statistical theory has shown that for a 
normal random variable with mean 0 and estimated standard deviation y/MSE, a good ap¬ 
proximation of the expected value of the kth smallest observation in a random sample of 行 is: 


y/MSE 


z\ 


'k-315 
n + .25 


(3.6) 


where z{A) as usual denotes the (A)100 percentile of the standard normal distribution. 

Using this approximation, let us calculate the expected values of the residuals under 
normality for the Toluca Company example. Column 2 of Table 3.2 shows the ranks of 
the residuals, with the smallest residual being assigned rank 1. We see that the rank of the 
residual for run 1, e { - 51.02, is 22, which indicates that this residual is the 22nd smallest 
among the 25 residuals. Hence, for this residual k ― 22. We found earlier (Table 2.1) that 
MSE = 2,384. Hence: 


k - .375 22 - .375 21.625 

-—-=-= .8564 

n -1- .25 25 + .25 25.25 

so that the expected value of this residual under normality is: 


x/2,384[z(.8564)] - ^2,384( 1.064) - 51.95 


Similarly, the expected value of the residual for run 2, = —48.47, is obtained by noting 
that the rank of this residual is ^ = 5; in other words, this residual is the fifth smallest one 
among the 25 residuals. Hence, we require (k — .375)/(n + .25) = (5— .375)/(25 + .25)= 
.1832, so that the expected value of this residual under normality is: 

^2,384[2(.1832)] = ^384(-.9032) -44.10 


Table 3.2, column 3, contains the expected values under the assumption of normality 
for a portion of the 25 residuals. Figure 3.2d presents a plot of the residuals against their 
expected values under normality. Note that the points in Figure 3.2d fall reasonably close to 
a straight line, suggesting that the distribution of the error terms does not depart substantially 
from a normal distribution. 

Figure 3.9 shows three normal probability plots when the distribution of the error terms 
departs substantially from normality. Figure 3.9a shows a normal probability plot when 
the error term distribution is highly skewed to the right. Note the concave-upward shape 



112 Part One Simple Linear Regression 


FIGURE 3.9 Normal Probability Plots when Error Terra Distribution Is Not Normal. 


(a) Skewed Right 


(b) Skewed Left 


(c) Symmetrical with Heavy Tails 



-3-2 -1 0 1 2 3 -3 -2 -1 0 1 2 3 -3-2-1 0 1 2 3 

Expected Expected Expected 


of the plot. Figure 3.9b shows a normal probability plot when the error term distribution 
is highly skewed to the left. Here, the pattern is concave downward. Finally，Figure 3.9c 
shows a normal probability plot when the distribution of the error terms is symmetrical but 
has heavy tails; in other words, the distribution has higher probabilities in the tails than 
a normal distribution. Note the concave-downward curvature in the plot at the left end, 
corresponding to the plot for a left-skewed distribution, and the concave-upward plot at the 
right end, corresponding to a right-skewed distribution. 

Comments 

1. Many computer packages will prepare normal probability plots, either automatically or at the 
option of the user. Some of these plots utilize semistudentized residuals, others omit the factor y/MSE 
in (3.6), but neither of these variations affect the nature of the plot 

2. For continuous data, ties among the residuals should occur only rarely. If two residuals do have 

the same value, a simple procedure is to use the average rank for the tied residuals for calculating the 
corresponding expected values. ■ 

Difficulties in Assessing Normality. The analysis for model departures with respect to 
normality is, in many respects, more difficult than that for other types of departures. In the 
first place, random variation can be particularly mischievous when studying the nature of 
a probability distribution unless the sample size is quite lai^e. Even worse, other types of 
departures can and do affect the distribution of the residuals. For instance, residuals may 
appear to be not normally distributed because an inappropriate regression function is used or 
because the error variance is not constant. Hence, it is usually a good strategy to investigate 
these other types of dep£atures first, before concerning oneself with the normality of the 
error terms. 

Omission of Important Predictor Variables 

Residuals should also be plotted against variables omitted from the model that might have 
important effects on the response. The time variable cited earlier in the welding example is 


I 


Chapter 3 Diagnostics and Remedial Measures 113 


FIGURE 3.10 
Residual Hots 
for Possible 
Omission of 
Important 
Predictor 
Variable — 
productivity 
Example. 


(a) Both Machines 


參 

O 


• *. 


Age of Worker 


X 


(b) Company A Machines 


Age of Worker 


an illustration. The purpose of this additional analysis is to determine whether there are 
any other key variables that could provide important additional descriptive and predictive 
power to the model. 

As another example, in a study to predict output by piece-rate workers in an assembling 
operation, the relation between output (F) and age (X) of worker was studied for a sample 
of employees. The plot of the residuals against X, shown in Figure 3.10a, indicates no 
ground for suspecting the appropriateness of the linearity of the regression function or the 
constancy of the error variance. Since machines produced by two companies (A and B) are 
used in the assembling operation and could have an effect on output, residual plots against 
X by type of machine were undertaken and are shown in Figures 3.10b and 3.10c. Note 
that the residuals for Company A machines tend to be positive, while those for Company B 
machines tend to be negative. Thus, typfe of machine appears to have a definite effect on 
productivity, and output predictions may turn out to be far superior when this variable is 
added to the model. 


Age of Worker 


X 


e 


l ro n plsdG 



IGnpjsd^ 



lenpls- 



114 Part One Simple Linear Regression 


While this second example dealt with a qualitative variable (type of machine), the resid¬ 
ual analysis for an additional quantitative variable is analogous. The residuals are plotted 
against the additional predictor variable to see whether or not the residuals tend to vary 
systematically with the level of the additional predictor variable. 

Comment 

We do not say that the original model is “wrong” when it can be improved materially by adding one or 
more predictor variables. Only a few of the factors operating on any response variable Y in real-world 
situations can be included explicitly in a regression model. The chief purpose of residual analysis in 
identifying other important predictor variables is therefore to test the adequacy of the model and see 
whether it could be improved materially by adding one or more predictor variables. ■ 

Some Final Comments 

1. We discussed model departures one at a time. In actuality, several types of departures 
may occur together. For instance, a linear regression function may be a poor fit and the 
variance of the error terms may not be constant. In these cases, the prototype patterns of 
Figure 3.4 can still be useful, but they would need to be combined into composite patterns. 

2. Although graphic analysis of residuals is only an informal method of analysis, in 
many cases it suffices for examining the aptness of a model. 

3. The basic approach to residual analysis explained here applies not only to simple 
linear regression but also to more complex regression and other types of statistical models. 

4. Several types of departures from the simple linear regression model have been identi¬ 
fied by diagnostic tests of the residuals. Model misspecification due to either nonlinearity or 
the omission of important predictor variables tends to be serious, leading to biased estimates 
of the regression parameters and error variance. These problems are discussed further in 
Section 3.9 and Chapter 10. Nonconstancy of error variance tends to be less serious, leading 
to less efficient estimates and invalid error variance estimates. The problem is discussed in 
depth in Section 11.1. The presence of outliers can be serious for smaller data sets when 
their influence is large. Influential outliers are discussed further in Section 10.4. Finally, the 
nonindependence of error terms results in estimators that are unbiased but whose variances 
are seriously biased. Alternative estimation methods for correlated errors are discussed in 
Chapter 12. 

3.4 Overview of Tests Involving Residuals_ 

Graphic analysis of residuals is inherently subjective. Nevertheless, subjective analysis of a 
variety of interrelated residual plots will frequently reveal difficulties with the model more 
clearly than particular formal tests. There are occasions, however, when one wishes to put 
specific questions to a test. We now briefly review some of the relevant tests. 

Most statistical tests require independent observations. As we have seen, however, the 
residuals are dependent. Fortunately, the dependencies become quite small for large samples, 
so that one can usually then ignore them. 

Tests for Randomness 

A runs test is frequently used to test for lack of randomness in the residuals arranged in time 
order. Another test, specifically designed for lack of randomness in least squares residuals, 
is the Durbin-Watson test. This test is discussed in Chapter 12. 



Chapter 3 Diagnostics and Remedial Measures 115 


Tests for Constancy of Variance 

When a residual plot gives the impression that the variance maybe increasing or decreasing 
in a systematic manner related to X or ^{F}, a simple test is based on the rank correlation 
between the absolute values of the residuals and the corresponding values of the predictor 
variable. Two other simple tests for constancy of the error variance — the Brown-Forsythe 
test and the Breusch-Pagan test—are discussed in Section 3.6. 

Tests for Outliers 

A simple test for identifying an outlier observation involves fitting a new regression line to 
the other n — 1 observations. The suspect observation, which was not used in fitting the new 
line, can now be regarded as a new observation. One can calculate the probability that in n 
observations, a deviation from the fitted line as great as that of the outlier will be obtained 
by chance. If this probability is sufficiently small, the outlier can be rejected as not having 
come from the same population as the other n — \ observations. Otherwise, the outlier^is 
retained. We discuss this approach in detail in Chapter 10. 

Many other tests to aid in evaluating outliers have been developed. These are discussed 
in specialized references, such as Reference 3.1. 

Tests for Normality 

Goodness of fit tests can be used for examining the normality of the error terms. For instance, 
the chi-square test or the Kolmogorov-Smirnov test and its modification, the Lilliefors test, 
can be employed for testing the normality of the error terms by analyzing the residuals. 
A simple test based on the normal probability plot of the residuals will be taken up in 
Section 3.5. 

Comment 

The runs test, rank correlation, and goodness of fit tests are commonly used statistical procedures and 
are discussed in many basic statistics texts. ■ 


3.5 Correlation Test for Normality_ 

In addition to visually assessing the approximate linearity of the points plotted in a nor¬ 
mal probability plot, a formal test for normality of the error terms can be conducted by 
calculating the coefficient of correlation (2.74) between the residuals e t and their expected 
values under normality. A high value of the correlation coefficient is indicative of normality. 
Table B.6, prepared by Looney and Gulledge (Ref. 3.2), contains critical values (percentiles) 
for various sample sizes for the distribution of the coefficient of correlation between the 
ordered residuals and their expected values under normality when the error terms are nor¬ 
mally distributed. If the observed coefficient of correlation is at least as large as the tabled 
value, for a given a level, one can conclude that the error terms are reasonably normally 
distributed. 

_ * 

Example F° r the Toluca Company example in Table 3.2, the coefficient of correlation between the 
ordered residuals and their expected values under normality is .991. Controlling the a risk 
at .05, we find from Table B.6 that the critical value for « = 25 is .959. Since the observed 
coefficient exceeds this level, we have support for our earlier conclusion that the distribution 
of the error terms does not depart substantially from a normal distribution. 



116 Part One Simple Linear Regression 


Comment 

The correlation test for normality presented here is simpler than the Shapiro-Wilk test (Ref. 3.3), 
which can be viewed as being based approximately also on the coefficient of correlation between the 
ordered residuals and their expected values under normality. ■ 


3.6 Tests for Constancy of Error Variance_ 

We present two formal tests for ascertaining whether the error terms have constant variance: 
the Brown-Forsythe test and the Breusch-Pagan test. 

Brown-Forsythe Test 广 

The Brown-Forsythe test, a modification of the Levene test (Ref. 3.4), does not depend 
on normality of the error terms. Indeed, this test is robust against serious departures from 
normality, in the sense that the nominal significance level remains approximately correct 
when the error terms have equal variances even if the distribution of the error terms is 
far from normal. Yet the test is still relatively efficient when the error terms are normally 
distributed. The Brown-Forsythe test as described is applicable to simple linear regression 
when the variance of the error terms either increases or decreases with X, as illustrated in 
the prototype megaphone plot in Figure 3.4c. The sample size needs to be large enough so 
that the dependencies among the residuals can be ignored. 

The test is based on the variability of the residuals. The larger the error variance, the 
larger the variability of the residuals will tend to be. To conduct the Brown-Forsythe test, we 
divide the data set into two groups, according to the level of X, so that one group consists 
of cases where the X level is comparatively low and the other group consists of cases where 
the X level is comparatively high. If the error variance is either increasing or decreasing 
with X, the residuals in one group will tend to be more variable than those in the other 
group. Equivalently, the absolute deviations of the residuals around their group mean will 
tend to be larger for one group than for the other group. In order to make the test more 
robust, we utilize the absolute deviations of the residuals around the median for the group 
(Ref. 3.5). The Brown-Forsythe test then consists simply of the two-sample t test based on 
test statistic (A.67) to determine whether the mean of the absolute deviations for one group 
differs significantly from the mean absolute deviation for the second group. 

Although the distribution of the absolute deviations of the residuals is usually not normal, 
it has been shown that the t* test statistic still follows approximately the t distribution when 
the variance of the error terms is constant and the sample sizes of the two groups are not 
extremely small. 

We shall now use en to denote the /th residual for group 1 and e i2 to denote the /th 
residual for group 2. Also we shall use n\ and n 2 to denote the sample sizes of the two 
groups, where: 

n = n\ + (3.7) 

Further, we shall use e\ and e-i to denote the medians of the residuals in the two groups. 
The Brown-Forsythe test uses the absolute deviations of the residuals around their group 
median, to be denoted by d t \ and d i2 ： 


dn = \e n - ei\ d i2 = \e ； 2 - e 2 \ 


(3.8) 


Chapter 3 Diagnostics and Remedial Measures 117 


Example 


With this notation, the two-sample t test statistic (A.67) becomes: 


4 


F — 


d\ — d2 


V ni n 2 


(3.9) 


where <Jj and 為 are the sample means of the dn and d。， respectively, and the pooled variance 
s 2 in (A.63) becomes: 


2 — ^ i ) 2 + _ 屯) 2 


(3.9a) 


We denote the test statistic for the Brown-Forsythe test by t% F . 

If the error terms have constant variance and n\ and n 2 are not extremely small, t% F 
follows approximately the t distribution with n — 2 degrees of freedom. Large absolute 
values of t% F indicate that the error terms do not have constant variance. ^ 


We wish to use the Brown-Forsythe test for the Toluca Company example to determine 
whether or not the error term variance varies with the level of X. Since the X levels are 
spread fairly uniformly (see Figure 3.1a), we divide the 25 cases into two groups with 
approximately equal X ranges. The first group consists of the 13 runs with lot sizes from 
20 to 70. The second group consists of the 12 runs with lot sizes from 80 to 120. Table 3.3 


TABLE 3.3 
Calculations 
for Brown- 
Forsythe Test 
for Constancy 
of Error 
Variance — 
Toluca 
Company 
Example. 


Group 1 




0) 

(2) 

(3) 

(4) 



Lot 

Residual 



i 

Run 

Size 

均 1 

dn 

(4i — ^i) 2 

i 

14 

20 

— 20.77 

.89 

1,929.41 

2 

2 

30 

二 48‘47 

28.59 

263.25 

12 

12 

70 

-60.28 

40*40 

19,49 

13 

25 

70 

10.72 

30.60 

202.07 


Total 



582.60 

12,566.6 




— 19>88 d% = 

= 44.815 





Group 2 





0) 

&> 

- .4 ■ • 

- ' - ， j 

(4) 



tot 

Residual 



i 

Run 

Size 

负 2 

* 42 

(:4.2 — 成 ) 2 

i 

1 

80 

31,02 

53.70 

637.56 

2 

8 

80 

4.02 

6,70 

473.06 

… 

… 

… 


…. 


11 

20 

110 

-34.09 - 

31.41 

- 8.76 

12 

7 

120 

55.21 , 

57.89 

866.71 


Total 



341.40 

9,610.2 



e 2 = 

^2.68 d 2 = 

28.450 

- 





8 Part One Simple Linear Regression 


presents a portion of the data for each group. In columns 1 and 2 are repeated the lot sizes 
and residuals from Table 1.2. We see from Table 3.3 that the median residual is & = —19.88 
for group 1 and e 2 = —2.68 for group 2. Column 3 contains the absolute deviations of the 
residuals around their respective group medians. For instance, we obtain: 

dn — ™ G]I = I — 20.77 — ( — 19.88)j = .89 

d n = \en - e 2 \ = )51.02 - (-2.68) 卜 53.70 

The means of the absolute deviations are obtained in the usual fashion: 


d\ = 


582.60 

13 


-- 44.815 


- 341.40 

do = - 

12 


= 28.450 



Finally, column 4 contains the squares of the deviations of the d n and dn around their 
respective group means. For instance, we have: 


(dn -d x ) 2 = (.89 - 44.815) 2 = 1,929.41 
{d n ~ d 2 ) 2 = (53.70 — 28.450) 2 = 637.56 


We are now ready to calculate test statistic (3.9): 


2 12,566.6 + 9,610.2 

= 25 _ 2 


== 964.21 


5 = 31.05 

44.815 — 28.450 


l BF = 


3L05 VT3 + ^ 


二： 1.32 


To control the a risk at .05, we require f (.975; 23) = 2.069. The decision rule therefore is: 

If |fg f | < 2.069, conclude the error variance is constant 
If K f \ > 2.069, conclude the error variance is not constant 

Since = 132 < 2.069, we conclude that the error variance is constant and does not 
vary with the level of X. The two-sided 尸 -value of this test is .20. 

Comments 

1. If the data set contains many cases, the two-sample t test for constancy of error variance can 
be conducted after dividing the cases into three or four groups, according to the level of X, and using 
the two extreme groups. 

2. A robust test for constancy of the error variance is desirable because nonnormality and lack of 

constant variance often go hand in hand. For example, the distribution of the error terms may become 
increasingly skewed and hence more variable with increasing levels of X. ■ 

Breusch-Pagan Test 

A second test for the constancy of the error variance is the Breusch-Pagan test (Ref. 3.6). 
This test, a large-sample test, assumes that the error terms are independent and normally 
distributed and that the variance of the error term denoted by a?, is related to the level 



of X in the following way : 


Chapter 3 Diagnostics and Remedial Measures 119 


log e a? = y 0 -\- Y\Xi (3.10) 

Note that (3.10) implies that af either increases or decreases with the level of X, depending 
on the sign of y\- Constancy of error variance corresponds to yi = 0. The test of Hq ： yi =0 
versus H a : y\ 7 ^ 0 is carried out by means of regressing the squared residuals ej against X[ 
in the usual manner and obtaining the regression sum of squares, to be denoted by SSR*. 
The test statistic X\ P is as follows: 


SSR* 




(3.11) 


where SSR* is the regression sum of squares when regressing e 2 on X and SSE is the error 
sum of squares when regressing F on X. If Ho： = 0 holds and n is reasonably large, 
Xg P follows approximately the chi-square distribution with one degree of free^m. Large 
values of X\ P lead to conclusion H ai that the error variance is not constant. 


Example 

_ - ** 


To conduct the Breusch-Pagan test for the Toluca Company example, we regress the squared 
residuals in Table 1.2, column 5, against X and obtain SSR* = 7,896,128. We know from 
Figure 2.2 that SSE = 54,825. Hence, test statistic (3.11) is: 


心 (誓 皿 

To control the a risk at .05, we require x 2 (-95; 1) = 3.84. Since X\ p = .821 < 3.84, we 
conclude // 0 , that the error variance is constant The P-value of this test is .64 so that the 
data are quite consistent with constancy of the error variance. 


Comments 

1. The Breusch-Pagan test can be modified to allow for different relationships between the error 
variance and the level of X than the one in (3.10). 

2. Test statistic (3.11) was developed independently by Cook and Weisberg (Ref. 3.7), and the test is 

sometimes referred to as the Cook-Weisberg test. ■ 


3.7 F Test for Lack of Fit 


We next take up a formal test for determining whether a specific type of regression function 
adequately fits the data. We illustrate this test for ascertaining whether a linear regression 
function is a good fit for the data. , 

Assumptions 

* 

The lack of fit test assumes that the observations F for given X are ( 1 ) independent and 
(2) normally distributed, and that (3) the distributions of F^have the same variance a 2 . 

The lack of fit test requires repeat, observations at one or more X levels. In nonexperi- 
mental data, these may occur fortuitously，as when in a productivity study relating workers 5 
output and age, several workers of the same age happen to be included in the study. In an 
experiment, one can assure by design that there are repeat observations. For instance, in an 




120 Part One Simple Linear Regression 


Size of 
Minimum 
Deposit 
(dollars) 

A 


Source of 
Variation 

Regression 

Error 

Total 


(a) Data 




Size of 


Number 


Minimum 

Number 

of New 


Deposit 

of New 

Accounts 

Branch 

(dollars) 

Accounts 

n 

/ 

X ； 

Yi 

160 

7 

75 

42 

112 

8 

175 

124 

124 

9 

125 

150 

28 

10 

200 

104 

152 

11 

100 

136 

156 




(b) ANOVA Table 



SS 

df 

MS 


5,141.3 

1 

5,141.3 


14,741.6 

9 

1,638,0 


19,882.9 

10 




TABLE 3.4 

Data and _ 

Analysis of 
Variance 
Table — Bank 

Example. Branch 


2 

3 

4 

5 

6 


experiment on the effect of size of salesperson bonus on sales, three salespersons can be 
offered a particular size of bonus, for each of six bonus sizes, and their sales then observed. 

Repeat trials for the same level of the predictor variable, of the type described, are called 
replications. The resulting observations are called replicates. 

In an experiment involving 12 similar but scattered suburban branch offices of a commercial 
bank, holders of checking accounts at the offices were offered gifts for setting up money 
market accounts. Minimum initial deposits in the new money market account were specified 
to qualify for the gift. The value of the gift was directly proportional to the specified 
minimum deposit. Various levels of minimum deposit and related gift values were used in 
the experiment in order to ascertain the relation between the specified minimum deposit 
and gift value, on the one hand, and number of accounts opened at the office, Gffthe other. 
Altogether, six levels of minimum deposit and proportional gift value were used, with two 
of the branch offices assigned at random to each level. One branch office had a fire during 
the period and was dropped from the study. Table 3.4a contains the results, where X is the 
amount of minimum deposit and Y is the number of new money market accounts that were 
opened and qualified for the gift during the test period. 

A linear regression function was fitted in the usual fashion; it is; 

Y = 50.72251 + .48670X 

The analysis of variance table also was obtained and is shown in Table 3.4b. A scatter plot, 
together with the fitted regression line, is shown in Figure 3.11. The indications are strong 
that a linear regression function is inappropriate. To test this formally, we shall use the 
general linear test approach described in Section 2.8. 


5 0 0 5 0 5 
2 o o 7 5 7 
^ ^ 2 1 — 1 



Chapter 3 Diagnostics and Remedial Measures 121 



100 150 200 

Size of Minimum Deposit 


FIGURE 3.11 

Scatter Plot 

and Fitted 

Regression 

Line ― Bank 
Example. 


TABLE 3.5 

Data Arranged 


Size of Minimum Deposit (dollars) 


by Replicate 


\= 1 / = 2 

7 = 3 

j — 4 

i = s 

/ = 6 

Number and 

Replicate 

X 1 - 75 入 2 = 100 

X 3 = 125 

X 4 - 150 

X s = 175 

X 6 -200 

Minimum 

/ = 1 

28 112 

160 

152 

156 

124 

Deposit—Bank 

j ~ 2. 

42 136 

150 


124 

104 

Example. 

Notation 

各 Mean V ； 

35 124 

155 

152 

140 

114 


First, we need to modify our notation to recognize the existence of replications at some levels 
of X. Table 3.5 presents the same data as Table 3.4a, but in an arrangement that recognizes 
the replicates. We shall denote the different X levels in the study, whether or not replicated 
observations are present, as X l5 ..., X c . For the bank example, c = 6 since there are six 
minimum deposit size levels in the study, for five of which there are two observations and 
for one there is a single observation. We shall let Xi = 75 (the smallest minimum deposit 
level), X 2 = 100,..., Xg = 200. Further, we shall denote the number of replicates for the 
jth level of X as rij\ for our example, n\ = n 2 = = ns = = 2 and = 1. Thus, the 

total number of observations n is given by: 

C 

n = (3.12) 

» - ts 

尸 l 

We shall denote the observed value of the response variable for the ith replicate for 
the yth level of X by Yij, where i = 1, ..., rij, j = 1, ..., c. For the bank example 
(Table 3.5), F n = 28, F 2 i = 42, F 12 = 112, and so on. Finally, we shall denote the 
mean of the Y observations at the level X ~ l Xj by Yj. Thus, = (28 + 42)/2 = 35 and 
? 4 = 152/1 = 152. 一 

Full Model 

The general linear test approach begins with the specification of the full model. The full 
model used for the lack of fit test makes the same assumptions as the simple linear regression 
model (2.1) except for assuming a linear regression relation, the subject of the test. This 
full model is: 


752575 


-unoyv 


Yij ~ ijij + Sij Full model 


(3.13) 



2 Part One Simple Linear Regression 


where: 

(ij are parameters j = 1， • • • ， c 
s i} are independent A/(0, o 2 ) 

Since the error terms have expectation zero, it follows that; 

= (3.14) 

Thus, the parameter jjij (j = 1,..., c) is the mean response when X - Xj. 

The full model (3.13) is like the regression model (2.1) in stating that each response 
Y is made up of two components: the mean response when X — Xj and a ran^em error 
term. The difference between the two models is that in the full model (3.13) there are no 
restrictions on the means fij ， whereas in the regression model (2.1) the mean responses are 
linearly related to X (i.e., £{F} = A) + j6iX). 

To fit the full model to the data, we require the least squares or maximum likelihood 
estimators for the parameters fij. It can be shown that these estimators of are simply the 
sample means F y : 

A/ = Yj (3.15) 

Thus, the estimated expected value for observation Y；j is ? y -, and the error sum of squares 
for the full model therefore is: 

SSE(F) = ~ = SSPE (3.16) 

In the context of the test for lack of fit, the fall model error sum of squares (3.16) is called 
the pure error sum of squares and is denoted by SSPE. 

Note that SSPE is made up of the sums of squared deviations at each X level. At level 
X — Xj, this sum of squared deviations is: 

Ed h) 2 ( 3 - 17 ) 

/ 

These sums of squares are then added over all of the X levels (j — 1,..., c). For the bank 
example, we have: 

SSPE = (28 - 35) 2 + (42 — 35) 2 + (112— 124) 2 + (136 - 124) 2 + (160- 155) 2 
+ (150- 155) 2 + (152 — 152) 2 + (156 - 140) 2 + (124 - 140) 2 
+ (124 - 114) 2 + (104 - 114) 2 

=1,148 

Note that any X level with no replications makes no contribution to SSPE because Yj = Y {j 
then. Thus, (152 — 152) 2 = 0 for j = 4 in the bank example. 

The degrees of freedom associated with SSPE can be obtained by recognizing that the 
sum of squared deviations (3.17) at a given level of X is like an ordinary total sum of squares 
based on n observations, which has n~\ degrees of freedom associated with it. Here, there 
are rij observations when X = Xy ； hence the degrees of freedom are n } — l. Just as SSPE 
is the sum of the sums of squares (3.17), so the number of degrees of freedom associated 



Chapter 3 Diagnostics cmd Remedial Measures 123 


with SSPE is the sum of the component degrees of freedom: 

dff = 工 — 1 ) = 53 ";' ~ c ~ n — c ( 3 . 18 ) 

j j 

For the bank example, we have df F =11 — 6 = 5. Note that any X level with no replications 
makes no contribution to dff because — 1 = 1—1=0 then, just as such an X level 
makes no contribution to SSPE. 


Reduced Model 

The general linear test approach next requires consideration of the reduced model under 
Hq. For testing the appropriateness of a linear regression relation, the alternatives are: 


Ho.E{Y} = ^ 0 ^^X 
H a :E{Y }^^ 0 + ^X 

Thus, Hq postulates that fij in the full model (3.13) is linearly related to Xj ： 

t^j = A) + AA 

iThe reduced model under Hq therefore is: 

Y t j = j6 0 + + 6ij Reduced model 

Note that the reduced model is the ordinary simple linearregression model (2.1), with the 
subscripts modified to recognize the existence of replications. We know that the estimated 
expected value for observation with regression model (2.1) is the fitted value F/^: 

Yij^bo-h^Xj ( 3 . 21 ) 

Hence, the error sum of squares for the reduced model is the usual error sum of squares SSE: 


( 3 . 19 ) 

l 


( 3 . 20 ) 


Test Statistic 


SSE(R) = + b x Xj)f 

We also know that the degrees of freedom associated with SSE(R) are: 

dfa ~ n — 2 

For the bank example, we have from Table 3.4b: 

SSE(R) = SSE= 14,741.6 
df R = 9 
} 

The general linear test statistic (2.70): 

SSE(R) — SSE(F) ' SSE(F) 


F* = 


here becomes: 


F* = 


dfn — df F df F 


SSE — SSPE SSPE 

(n ~~ 2 ) 一 (j% _ c) n ~ c 


( 3 . 22 ) 


( 3 . 23 ) 



24 Part One Simple Linear Regression 


The difference between the two error sums of squares is called the lack of fit sum of squares 
here and is denoted by SSLF: 


SSLF = SSE — SSPE 

We can then express the test statistic as follows: 


(3.24) 




SSLF SSPE 


c — 2 


n — c 


(3.25) 


MSLF 
二 MSPE 

where MSLF denotes the lack of fit mean square and MSPE denotes the pure error mean 
square. - 〆 

We know that large values of F* lead to conclusion H a in the general linear test. Decision 
rule (2.71) here becomes: 


If Z 7 * < F(1 — a;c — 2,n — c), conclude Hq 
I f F* > F(1 — a\c — 2,n — c), conclude H a 


(3.26) 


For the bank example, the test statistic can be constructed easily from our earlier results: 

SSPE = 1,148.0 n-c^= 11-6 = 5 

SSE= 14,741.6 

SSLF 二 14,741.6 — 1,148.0 = 13,593.6 c — 2 = 6 — 2 = 4 


F* 二 


13,593.6 1,148.0 


4 

3,398.4 

229^6 


-=14.80 


If the level of significance is to be a = .01, we require F(.99;4, 5) = 11.4. Since 
F* = 14.80 > 11.4, we conclude H a , that the regression function is not linear. This, of 
course, accords with our visual impression from Figure 3.11. The 尸 -value for the test is 
.006. 


ANOVA Table 

The definition of the lack of fit sum of squares SSLF in (3.24) indicates that we have, in 
fact, decomposed the error sum of squares SSE into two components: 

SSE 二 SSPE + SSLF (3.27) 


This decomposition follows from the identity: 


Ytj - Y,j = Yij - Yj + Yj - Yij 



Error Pure error Lack of fit 

deviation deviation deviation 


(3.28) 


This identity shows that the error deviations in SSE are made up of a pure error component 
and a lack of fit component. Figure 3.12 illustrates this partitioning for the case F 13 = 160, 
X 3 = 125 in the bank example. 



Chapter 3 Diagnostics and Remedial Measures 125 


FIGURE 3.12 

Illustration of 
Decomposition 
of Error 
Deviation 

Y” 一 % ~~ 

Bank 

Example. 





Size of Minimum Deposit (dollars) 


When (3.28) is squared and summed over all observations, we obtain (3.27) since the 
cross-product sum equals zero: 


- 歹 ") 2 = - 巧 ) 2 + y^y^c ^； - 

SSE = SSPE + SSLF 


(3.29) 


Note from (3.29) that we can define the lack of fit sum of squares directly as follows: 

SSLF^ Em 4) 2 (3.30) 

Since all Fy observations at the level X j have the same fitted value, which we can denote 
by Yj, we can express (3.30) equivalently as: 




(3.30a) 


Formula (3.30a) indicates clearly why SSLF measures lack of fit. If the linear regression 
function is appropriate, then the means Yj will be near the fitted values Yj calculated from 
the estimated linear regression function and SSLF will be small. On the other hand, if the 
linear regression function is not appropriate, the means Yj will not be near the fitted values 
calculated from the estimated linear ^egression function, as in Figure 3.11 for the bank 
example, and SSLF will be large. 

Formula (3.30a) also indicates why c — 2 degrees of freedom are associated with SSLF. 
There arec means Yj in the sum of squares, and two degrees of freedom are lost in estimating 
the parameters j6o and 色 of the linear regression function to obtain the fitted values Yj. 

An ANOVA table can be constructed for the decomposition of SSE. Table 3.6a contains 
the general ANOVA table, including the decomposition of SSE just explained and the 
mean squares of interest, and Table 3.6b contains the ANOVA decomposition for the bank 
example. 


K 


o o o 

6 3 0 
1 1 1 

-uncouvMafsJ^J^qEn 2 


—^ 1 


o 



126 Part One Simple Linear Regression 


TABLE 3.6 
General 

^ . 

(a) General 



ANOVA Table 

Source of 




for Testing 

Variation 

55 

df 

MS 

Lack of Fit of 




SSR 

Simple Linear 

Regression 

1 

MSR = —匕 

Regression 
Function and 
ANOVA 

Error 


n~2 

i 

SSE 

MSE= \ 
n — 2 

Table ― Bank 
Example. 

Lack of fit 

ssiF^EE^/-^/) 2 

c-2 

SSLF 

MSLF^ 

c~2 


Pure error 


n — c 

SSPE 

MSPE= - 

n — c 


Totaf 

s 射 =ES0W) 2 

/?— 1 




(b) Bank Example 




Sogrce of 
Varjation 

55 

df 

MS 


Reg^ssion 

5,141.3 

1 

5 ； 141.3 


Error 

14,741.6 

9 

; 1,638.0 


Lack of fit 

13,593.6 

4 

3,398.4 


Pure error 

1,148.0 

5 

229-6 


Total 

19,882.9 

10 



Comments 

1. As shown by the bank example, not all levels of X need have repeat observations for the F test 
for lack of fit to be applicable. Repeat observations at only one or some levels of X are sufficient. 

2. It can be shown that the mean squares MSPE and MSLF ha.\t the following expectations when 
testing whether the regression function is linear: 

E{MSPE) = o 2 (3.31) 

E{MSLF) = cr 2 + 灿 Pi x j)f (3.32) 

The reason for the term “pure error” is that MSPE is always an unbiased estimator of the error term 
variance o* 2 , no matter what is the true regression function. The expected value of MSLF also is o 2 if 
the regression function is linear, because fij = + then and the second term in (3.32) becomes 
zero. On the other hand, if the regression function is not linear, ju ,j ^ A) + Pi^j and E{MSLF\ will 
be greater than a 2 . Heance, a value of F* near 1 accords with a linear regression function; large values 
of F* indicate that the regression function is not linear. 

3. The terminology “error sum of squares” and “error mean square” is not precise when the 
regression function under test in Ho is not the true function since the error sum of squares and error 
mean square then reflect the effects of both the lack of fit and the variability of the error terms. We 
continue to use the terminology for consistency and now use the term “pure error” to identify the 
variability associated with the error term only. 



Chapter 3 Diagnostics and Remedial Measures 127 


4. Suppose that prior to any analysis of the appropriateness of the model, we had fitted a linear 
regression model and wished to test whether or not 氏 = 0 for the bank example (Table 3,4b). Test 
statistic (2.60) would be ： 


MSE 


5,141.3 

1,638:0 


= 3.14 


For a = .10, F(.90; 1,9) = 3.36, and we would conclude Ho, that 执二 0 or that there is no linear 
association between minimum deposit size (and value of gift) and number of new accounts. A conclu¬ 
sion that there is no relation between these variables would be improper, however. Such an inference 
requires that regression model (2.1) be appropriate. Here, there is a definite relationship, but the re¬ 
gression function is not linear. This illustrates the importance of always examining the appropriateness 
of a model before any inferences are drawn. 

5. The general linear test approach just explained can be used to test the appropriateness of other 
regression functions. Only the degrees of freedom for SSLF will need be modified. In general, c— p 
degrees of freedom are associated with SSLF, where p is the numb 改 of parameters in the regression 
function. For the test of a simple linear regression function, p = 2 because there are two parairjgters, 

and in the regression ftjnotion. 

6. The alternative H a in (3.19) includes all regression functions other than a linear one. For 
instance, it includes a quadratic regression function or a logarithmic one. If H a is concluded, a study 
of residuals can be helpful in identifying an appropriate function. 

7. When we conclude that the employed model in Ho is appropriate, the usual practice is to use 
the error mean square MSE as an estimator of o 1 2 in preference to the pure error mean square MSPE, 
since the former contains more degrees of freedom. 

8. Observations at the same level of X are genuine repeats only if they involve independent trials 
with respect to the error term. Suppose that in a r^ression analysis of the relation between hardness 
(Y) and amount of carbon (X) in specimens of an alloy, the error term in the model covers, among 
other things, random errors in the measurement of hardness by the analyst and effects of uncontrolled 
production factors, which vary at random from specimen to specimen and affect hardness. If the 
analyst takes two readings on the hardness of a specimen, this will not provide a genuine replication 
because the effects of random variation in the production factors are fixed in any given specimen. 
For genuine replications, different specimens with the same carbon content (X) would have to be 
measured by the analyst so that all the effects covered in the error term could vary at random from 
one repeated observation to the next 

9. When no replications are present in a data set, an approximate test for lack of fit can be 

conducted if there are some cases at adjacent X levels for which the mean responses are quite close to 
each other. Such adjacent cases are grouped together and treated as pseudoreplicates, and the test for 
lack of fit is then carried out using these groupings of adjacent cases. A useftil summary of this and 
related procedures for conducting a test for lack of fit when no replicates are present may be found in 
Reference 3.8. ® 


Overview of Remedial Measures 


If the simple linear regression model (2.1) is not appropriate for a data set, there are two 
basic choices: . 

1. Abandon regression model (2.1) and develop and use a more appropriate model. 

2. Employ some transformation on the data so that regression model (2.1) is appropriate 
for the transformed data. 



128 Part One Simple Linear Regression 


Each approach has advantages and disadvantages. The first approach may entail a more 
complex model that could yield better insights, but may also lead to more complex proce¬ 
dures for estimating the parameters. Successful use of transformations, on the other hand, 
leads to relatively simple methods of estimation and may involve fewer parameters than 
a complex model, an advantage when the sample size is small. Yet transformations may 
obscure the fundamental interconnections between the variables, though at other times they 
may illuminate them. 

We consider the use of transformations in this chapter and the use of more complex 
models in later chapters. First, we provide a brief overview of remedial measures. 

Nonlinearity of Regression Function # 

When the regression function is not linear, a direct approach is to modify regression 
model (2.1) by altering the nature of the regression function. For instance, a quadratic 
regression function might be used: 

£{F} = A) + AX + j6 2 X 2 
or an exponential regression function: 

E[Y} = Mf 

In Chapter 7, we discuss polynomial regression functions, and in Part III we take up nonlinear 
regression functions, such as an exponential regression function. 

The transformation approach employs a transformation to linearize, at least approxi¬ 
mately, a nonlinear regression function. We discuss the use of transformations to linearize 
regression functions in Section 3.9. 

When the nature of the regression function is not known, exploratory analysis that does 
not require specifying a particular type of function is often useful. We discuss exploratory 
regression analysis in Section 3.10. 

Nonconstancy of Error Variance 

When the error variance is not constant but varies in a systematic fashion, a direct approach 
is to modify the model to allow for this and use the method of weighted least squares tD 
obtain the estimators of the parameters. We discuss the use of weighted least squares for 
this purpose in Chapter 11. 

Transformations can also be effective in stabilizing the variance. Some of these are 
discussed in Section 3.9. 

Nonindependence of Error Terms 

When the error terms are correlated, a direct remedial measure is to work with a model that 
calls for correlated error terms. We discuss such a model in Chapter 12. A simple remedial 
transformation that is often helpful is to work with first differences, a topic also discussed 
in Chapter 12. 

Nonnormality of Error Terms 

Lack of normality and nonconstant error variances frequently go hand in hand. Fortunately, 
it is often the case that the same transformation that helps stabilize the variance is also helpful 
in approximately normalizing the error terms. It is therefore desirable that the transformation 



Chapter 3 Diagnostics and Remedied Measures 129 


for stabilizing the error variance be utilized first, and then the residuals studied to see if 
serious departures from normality are still present. We discuss transformations to achieve 
approximate normality in Section 3.9. 

Omission of Important Predictor Variables 

When residual analysis indicates that an important predictor variable has been omitted from 
the model, the solution is to modify the model. In Chapter 6 and later chapters, we discuss 
multiple regression analysis in which two or more predictor variables are utilized. 

Outlying Observations 

When outlying observations are present, as in Figure 3.7a, use of the least squares and 
maximum likelihood estimators (1.10) for regression model (2.1) may lead to serious dis¬ 
tortions in the estimated regression function. When the outlying observations do not repre¬ 
sent recording errors and should not be discarded, it may be desirable to use arAestimation 
procedure that places less emphasis on such outlying observations. We discuss one such 
robust estimation procedure in Chapter 11. 

3. 9 Transformations_ 

We now consider in more detail the use of transformations of one or both of the original 
variables before carrying out the regression analysis. Simple transformations of either the 
response variable Y or the predictor variable X, or of both, are often sufficient to make the 
simple linear regression model appropriate for the transformed data. 

Transformations for Nonlinear Relation Only 

We first consider transformations for linearizing a nonlinear regression relation when the 
distribution of the error terms is reasonably close to a normal distribution and the error 
terms have approximately constant variance. In this situation, transformations on X should 
be attempted. The reason why transformations on Y may not be desirable here is that a 
transformation on F, such as F, = ■>/¥, may materially change the shape of the distribution 
of the'error terms from the normal distribution and may also lead to substantially differing 
error term variances. 

Figure 3.13 contains some prototype nonlinear regression relations with constant error 
variance and also presents some simple transformations on X that may be helpful to lin¬ 
earize the regression relationship without affecting the distributions of Y. Several alternative 
transformations maybe tried. Scatter plots and residual plots based on each transformation 
should then be prepared and analyzed, to decide which transformation is most effective. 

Data from an experiment on the effect of number of days of training received (X) on 
performance (F") in a battery of simulated sales situations are presented in Table 3.7, 
columns 1 and 2, for the 10 participants in the study. A scatter plot of these data is shown in 
Figure 3.14a. Clearly the regression relation appears to be "curvilinear, so the simple linear 
regression model (2.1) does not seem ， to be appropriate. Since the variability at the different 

X levels appears to be fairly constant, we shall consider a transformation on X. Based on 
the prototype plot in Figure 3.13a, we shall consider initially the square root transformation 

X I = yfX. The transformed values are shown in column 3 of Table 3.7. 


Example 


130 Part One Simple Linear Regression 


Score 

Vi 

Xf = VXi 

42.5 

.70711 

50.6 

.70711 

68^5 

T.OOOOO 

80.7 

1.00000 

89.0 

1.22474 

99.6 

1.22474 

105.3 

1.41421 

111.8 

1.41421 

112.3 

1.58114 

125.7 

1.58114 


S 

Trainee 


2 

3 

4 


6 

7 

8 
9 

10 


TABLE 3.7 
Use of Square 
Root Transfor¬ 
mation of X to 
Linearize 
Regression 
Rdation — 
Sales Training 
Example. 


FIGURE 3.13 
Prototype 
Nonlinear 
Regression 
Patterns with 
Constant Error 
Variance and 
Simple Trans¬ 
formations 

of 兄 


⑻ 


(b) 


(c) 


In Figure 3.14b, the same data are plotted with the predictor variable transformed to 
X f = -y/X. Note that the scatter plot now shows a reasonably linear relation. The variability 
of the scatter at the different X levels is the same as before, since we did not make a 
transformation on Y. 

To examine further whether the simple linear regression model (2.1) is appropriate now, 
we fit it to the transformed X data. The regression calculations with the transformed X data 
are carried out in the usual fashion, except that the predictor variable now is X'. We obtain 
the following fitted regression function: 

f = —10.33+ 83.45X' 

Figure 3.14c contains a plot of the residuals against X 1 . There is no evidence of lack of 
fit or of strongly unequal error variances. Figure 3.l4d contains a normal probability plot of 


Prototype Regression Pattern Transformations of X 




X r =log 10 X 

X'=4x 



/ k 

X' = X 2 入 , : 

= exp(X) 

K 

- ..：：. 

X'=VX X'-- 

=exp(-X) 


ales 


( 1 ) 

Davs of 


( 2 ) 

Performance 


⑶ 



55005 500 5 5 
*-* ■•'••-••• _ • 

111 12 2 2,2 



Chapter 3 Diagnostics and RemMial Measures 131 


FIGURE 3.14 Scatter Plots and Residual Hots — Sales Training Example. 

(a) Scatter Plot (b) Scatter Plot against 4x 



0 1 2 3 0.6 0.8 1.0 1.2 1.4 1.6 



0.6 0.8 1.0 1.2 1.4 1.6 —10 -5 0 5 10 


4X Expected 


the residuals. No strong indications of substantial departures from normality are indicated 
by this plot. This conclusion is supported by the high coefficient of correlation between the 
ordered residuals and their expected values under normality, .979. For a = .01, Table B.6 
shows that the critical value is .879, so the observed coefficient is substantially larger 
and supports the reasonableness of normal error terms. Thus, the simple linear regression 
model (2.1) appears to be appropriate here for the transformed data. 

The fitted regression function in the original units of X can easily be obtained, if desired: 

Y = -10.33+ 83.45a/X 



132 Part One Simple Linear Regression 


FIGURE 3.15 Prototype Regression Pattern 

Prototype 

Regression 

Patterns with 



4y 

K f =log 10 V 

MY 

Note ； A simultaneous transformation on X may also be helpful or necessary. 

Comment 

At times, it may be helpful to introduce a constant into the transformation. For example, if some of 
the X data are near zero and the reciprocal transformation is desired, we can shift the origin by using 
the transformation X' = 1/(X + k), where k is an appropriately chosen constant. ■ 

Transformations for Nonnormality and Unequal Error Variances 

Unequal error variances and nonnormality of the error terms frequently appear together. 
To remedy these departures from the simple linear regression model (2.1), we need a 
transformation on F, since the shapes and spreads of the distributions of Y need to be 
changed. Such a transformation on Y may also at the same time help to linearize a curvilinear 
regression relation. At other times, a simultaneous transformation on X may be needed to 
obtain or maintain a linear regression relation. 

Frequently, the nonnormality and unequal variances departures from regression 
model (2.1) take the form of increasing skewness and increasing variability of the distribu¬ 
tions of the error terms as the mean response ^{F} increases. For example, in a regression 
of yearly household expenditures for vacations (F) on household income (X), there will 
tend to be more variation and greater positive skewness (i.e., some very high yearly vacation 
expenditures) for high-income households than for low-income households, who tend to 
consistently spend much less for vacations. Figure 3.15 contains some prototype regression 
relations where the skewness and the error variance increase with the mean response £{F}. 
This figure also presents some simple transformations on Y that may be helpful for these 
cases. Several alternative transformations on Y maybe tried, as well as some simultaneous 
transformations on X. Scatter plots and residual plots should be prepared to determine the 
most effective transformation(s). 

Data on age (X) and plasma level of a polyamine (F) for a portion of the 25 healthy 
children in a study are presented in columns 1 and 2 of Table 3.8. These data are plotted in 
Figure 3.16a as a scatter plot. Note the distinct curvilinear regression relationship, as well 
as the greater variability for younger children than for older ones. - 


Example 



Chapter 3 Diagnostics and Remedial Measures 133 


TABLE 3.8 

Use of 

Child 

(1) 

Age 

(2) 

Plasma Level 

1 ⑶ 

Logarithmic 

Transforma- 

i 


Y ； 

Y；^= log 10 Y ； 

tion of Y to 

i 

0 (newborn) 

13.44 

1.1284 

Linearize 

2 

0 (newborn) 

12.84 

1.1086 

Regression 

3 

0 (newborn) 

11.91 

1.0759 

Relation and 

4 

0 (newborn) 

20.09 

1.3030 

Stabilize Error 

5 

0 (newborn) 

15.6G 

1.1931 

Variance — 

6 

1.0 

10.11 

1.0048 

Plasma Levels 

7 

1.0 

11.38 

1.0561 

Example. 

* * * 


• • » ^ 

• • • 


19 

3.0 

6.90 

.8388 


20 

3.0 

6.77 

.8306 


21 

4.0 

4.86 

.6866 


22 

4.0 

5.10 

: >076 


23 

4.0 

5.67 

.7536 


24 

4*0 

5.75 

.7597 


25 

4.0 

6.23 

■7945„ 


L 


On the basis of the prototype regression pattern in Figure 3.15b, we shall first try the 
logarithmic transformation Y ! = log 10 Y. The transformed Y values are shown in column 3 
of Table 3.8. Figure 3.16b contains the scatter plot with this transformation. Note that the 
transformation not only has led to a reasonably linear regression relation, but the variability 
at the different levels of X also has become reasonably constant. 

To further examine the reasonableness of the transformation Y' = log 10 F, we fitted the 
simple linear regression model (2.1) to the transformed Y data and obtained: 

1.135-.1023X 

A plot of the residuals against X is shown in Figure 3.16c, and a normal probability plot of 
the residuals is shown in Figure 3.16d. The coefficient of correlation between the ordered 
residuals and their expected values under normality is .981. For a = .05, Table B.6 indicates 
that the critical value is .959 so that the observed coefficient supports the assumption of 
normality of the error terms. All of this evidence supports the appropriateness of regression 
model (2.1) for the transformed Y data. 

} 

Comments 

1. At times it may be desirable to introduce a constant into a transformation of Y t such as when 

Y maybe negative. For instance, the logarithmic transformation to shift the origin in Y and make all 

Y observations positive would be ^ = log ]0 (y 4 - k), where k is an appropriately chosen constant. 

2. When unequal error variances are present but the regression relation is linear, a transformation 

on Y may not be sufficient While such a transformation may stabilize the error variance, it will also 
change the linear relationship to a curvilinear one. A transformation on X may therefore also be 
required. This case can also be handled by using weighted least squares, a procedure explained in 
Chapter 11. ' ■ 



124 Part One Simple Linear Regression 


The difference between the two error sums of squares is called the lack of fit sum of squares 
here and is denoted by SSLF: 


SSLF = SSE — SSPE 

We can then express the test statistic as follows: 

* SSLF SSPE 

F* = ——- 4 -- 

c ~2 n — c 

_ MSLF 
_ MSPE 

where MSLF denotes the lack of fit mean square and MSPE denotes the pu 谈 error mean 
square. 

We know that large values of F* lead to conclusion H a in the general linear test. Decision 
rule (2.71) here becomes: 


(3.24) 


(3.25) 


If F* < F(1 — or; c — 2, n — c)，^pnclude Ho 
F* > F(1 — a;c — 2,« — c), conclude H a 


(3.26) 


For the bank example, the test statistic can be constructed easily from our earlier results: 


SSPE — 1,148.0 — c = ll 一 6 = 5 

SSE= 14,741.6 

SSLF = 14,741.6 — 1,148.0 = 13,593.6 c — 2 = 6 — 2 = 4 

,13,593.6 1,148.0 

F* = -+ —-— 

4 5 


3,398.4 
= 229.6 


= 14.80 


If the level of significance is to be or = .01, we require F(.99;4, 5) = 11.4. Since 
F*= 14.80 > 11.4, we conclude H a , that the regression function is not linear. This, of 
course, accords with our visual impression from Figure 3.11. The P-value for the test is 
.006. 


ANOVA Table 

The definition of the lack of fit sum of squares SSLF in (3.24) indicates that we have, in 
fact, decomposed the error sum of squares SSE into two components: 

SSE = SSPE + SSLF (3.27) 


This decomposition follows from the identity: 


Yij — Yij = Yij — Yj + Yj — Yij 



Error Pure error Lack of fit 

deviation deviation deviation 


(3.28) 


This identity shows that the error deviations in SSE are made up of a pure error component 
and a lack of fit component. Figure 3.12 illustrates this partitioning for the case = 160, 
X 3 = 125 in the bank example. 



Chapter 3 Diagnostics and Remedial Measures 125 



75 1 00 1 25 1 50 X ^ 

Size of Minimum Deposit (dollars) 


FIGURE 3.12 
Illustration of 
Decomposition 
of Error 
Deviation 

Yij — Yij — 

Bank 

Example. 


When (3.28) is squared and summed over all observations, we obtain (3.27) since the 
cross-product sum equals zero: 

E E( W = E E ⑺厂 ? /) 2 + E Ed &) 2 (329) 

SSE = SSPE + SSLF • 

Note from (3.29) that we can define the lack of fit sum of squares directly as follows: 

SSLF = (330) 

Since all Yjj observations at the level Xj have the same fitted value, which we can denote 
by Yj, we can express (3.30) equivalently as: 

SSLF= - t) 2 (3.30a) 

j 

Formula (3.30a) indicates clearly why SSLF measures lack of fit. If the linear regression 
function is appropriate, then the means Yj will be near the fitted values Yj calculated from 
the estimated linear regression function and SSLF will be small. On the other hand, if the 
linear regression function is not appropriate, the means Yj will not be near the fitted values 
calculated from the estimated linear ^regression function, as in Figure 3.11 for the bank 
example, and SSLF will be large. 

Formula (3.30a) also indicates why c — 2 degrees of freedom are associated with SSLF. 
There are c means Yj in the sum of squares, and two degrees of freedom are lost in estimating 
the parameters and of the linear regression function to obtain the fitted values Yj. 

An ANOVA table can be constructed for the decomposition of SSE. Table 3.6a contains 
the general ANOVA table, including the decomposition of SSE just explained and the 
mean squares of interest, and Table 3.6b contains the ANOVA decomposition for the bank 
example. 




o o o 

6 3 0 
1 1 1 

-unoyv Maz p J^qiz 



126 Part One Simple Linear Regression 


TABLE 3.6 

General 
ANOVA Table 
for Testing 
Lack of Fit of 
Simple Linear 
Regression 
Function and 
ANOVA 
Table~Bank 
Example. 


(a) General 


Source of 
Variation 

SS 

df 

MS 

Regression 


1 

MSR^ 

SSR 

1 

Error 


n-2 

MSE== 

5SE 

n — 2 

Lack of fit 

SSJLF=EE(^W,7) 2 

c 一 2 

MSLF = 

SSLF 
~ c — 2 

Pure error 


n-c 

MSPE: 

SSPE ^ 
~ n— c 

Total 

SSJO-E^iYr.-V) 2 

n~ 1 




(b) Bank Example 





Source of 
Variation 

SS 

df 

MS 

Regression 

5,141.3 

T 

,5 ； 141.3 

Error 

14>741.6 

9 

/1^38.0 

Lack of fit 

13,593.6 

4 

3,398.4 

Pure error 

1,148.0 

5 

229.6 

Total 

19,882.9 

10 



Comments 


1. As shown by the bank example, not all levels of X need have repeat observations for the F test 
for lack of fit to be applicable. Repeat observations at only one or some levels of X are sufficient 

2. It can be shown that the mean squares MSPE and MSLF have the following expectations when 
testing whether the regression function is linear: 

E{MSPE] = o 2 (3.31) 


E{MSLF] =cr 2 + 


一 （ A) + Xj )] 2 

c — 2 


(3.32) 


The reason for the term “pure errof’ is that MSPE is always an unbiased estimator of the error term 
variance c 2 , no matter what is the true regression function. The expected value of MSLF also is o 2 if 
the regression function is linear, because fij = then and the second term in (3.32) becomes 

zero. On the other hand, if the regression function is not linear, / A) + and jEfMSLF) will 
be greater than o 2 . Hence, a value of F* near 1 accords with a linear regression function; large values 
of F* indicate that the regression function is not linear. 

3. The terminology “error sum of squares” and “error mean square” is not precise when the 
regression function under test in Hq is not the true function since the error sum of squares and error 
mean square then reflect the effects of both the lack of fit and the variability of the error terms. We 
continue to use the terminology for consistency and now use the term “pure error” to identify the 
variability associated with the error term only. 



Chapter 3 Diagnostics and Remedial Measures 127 


4. Suppose that prior to any analysis of the appropriateness of the model, we had fitted a linear 
regression model and wished to test whether or not = 0 for the bank example (T^ble 3.4b). Tfest 
statistic (2.60) would be: 

心 — = yil ^ =314 
MSE 1,638.0 ' 

For a = .10, F(.90; 1, 9) = 3.36, and we would conclude Ho, that 的 =0 or that there is no linear 
association between minimum deposit size (and value of gift) and number of new accounts. A conclu¬ 
sion that there is no relation between these variables would be improper, however. Such an inference 
requires that regression model (2.1) be appropriate. Here, there is a definite relationship, but the re¬ 
gression function is not linear. This illustrates the importance of always examining the appropriateness 
of a model before any inferences are drawn. 

5. The genial linear test approach just explained can be used to test the appropriateness of other 
regression functions. Only the degrees of freedom for SSLF will need be modified. In general, c — p 
degrees of freedom are associated with SSLF, where p is the number of parameters in theg-egression 
function. For the test of a simple linear regression function, p = 2 because there are two parameters, 
Po and in the regression function. 

6. The alternative H a in (3.19) includes all regression functions other than a linear one. For 
instance, it includes a quadratic regression function or a logarithmic one. If H a is concluded, a study 
of residuals can be helpful in identifying an appropriate function. 

7. When we conclude that the employed model in H 0 is appropriate, the usual practice is to use 
the error mean square MSE as an estimator of a 2 in preference to the pure error mean square MSPE, 
since the former contains more degrees of freedom. 

8. Observations at the same level of X are genuine repeats only if they involve independent trials 
with respect to the error term. Suppose that in a regression analysis of the relation between hardness 
(V) and amount of carbon (X) in specimens of an alloy, the error term in the model cov 改 s, among 
other things, random errors in the measurement of hardness by the analyst and effects of uncontrolled 
production factors, which vary at random from specimen to specimen and affect hardness. If the 
analyst takes two readings on the hardness of a specimen, this will not provide a genuine replication 
because the effects of random variation in the production factors are fixed in any given specimen. 
For genuine replications, different specimens with the same carbon content (X) would have to be 
measured by the analyst so that all the effects covered in the error term could vary at random from 
one repeated observation to the next 

9. When no replications are present in a data set, an approximate test for lack of fit can be 

conducted if there are some cases at adjacent X levels for which the mean responses are quite close to 
each oth 改 . Such adjacent cases are grouped togeth 改 and treated as pseudoreplicates, and the^test for 
lack of fit is then carried out using these groupings of adjacent cases. A useful summary of this and 
related procedures for conducting a test for lack of fit when no replicates are present maybe found in 
Reference 3.8. _ 

} 

Overview of Remedial Measures_ 

a 

If the simple linear regression model (2.1) is not appropriate for a data set, there are two 
basic choices: * * 

» 

1. Abandon regression model (2.1) and develop and use a more appropriate model. 

2. Employ some transformation on the data so that regression model (2.1) is appropriate 

for the transformed data. 



128 Part One Simple Linear Regression 


Each approach has advantages and disadvantages. The first approach may entail a more 
complex model that could yield better insights, but may also lead to more complex proce¬ 
dures for estimating the parameters. Successful use of transformations, on the other hand, 
leads to relatively simple methods of estimation and may involve fewer parameters than 
a complex model, an advantage when the san^ple size is small. Yet transformations may 
obscure the fundamental interconnections between the variables, though at other times they 
may illuminate them. 

We consider the use of transformations in this chapter and the use of more complex 
models in later chapters. First, we provide a brief overview of remedial measures. 

Nonlinearity of Regression Function 

When the regression function is not linear, a direct approach is to mdifffy regression 
model (2.1) by altering the nature of the regression function. For instance, a quadratic 
regression function might be used: 

五 rn = a>+ax+j6 2 i 2 

or an exponential regression function: 

E[Y}^ 

In Chapter 7, we discuss polynomial regression functions, and in Partin we tateup nonlinear 
regression functions, such as an exponential regression function. 

The transformation approach employs a transformation to linearize, at least approxi¬ 
mately, a nonlinear regression function. We discuss the use of transformations to linearize 
regression functions in Section 3.9. 

When the nature of the regression function is not known, exploratory analysis that does 
not require specifying a particular type of function is often useful. We discuss exploratory 
regression analysis in Section 3,10. 

Nonconstancy of Error Variance 

When the error variance is not constant but varies in a systematic fashion, a direct approach 
is to modify the model to allow for this and use the method of weighted least squares to 
obtain the estimators of the parameters. We discuss the use of weighted least squares for 
this purpose in Chapter 11. 

Transformations can also be effective in stabilizing the variance. Some of these are 
discussed in Section 3.9. 

Nonindependence of Error Terms 

When the error terms are correlated, a direct remedial measure is to work with a model that 
calls for correlated error terms. We discuss such a model in Chapter 12. A simple remedial 
transformation that is often helpjful is to work with first differences, a topic also discussed 
in Chapter 12. 

Nonnormality of Error Terms 

Lack of normality and nonconstant error variances frequently go hand in hand. Fortunately, 
it is often the case that the same transformation that helps stabilize the variance is also helpful 
in approximately normalizing the error terms. It is therefore desirable that the transformation 



Chapter 3 Diagnostics and Remedial Measures 129 


for stabilizing the error variance be utilized first，and then the residuals studied to see if 
serious departures from normality are still present. We discuss transformations to achieve 
approximate normality in Section 3.9. 

Omission of Important Predictor Variables 

When residual analysis indicates that an important predictor variable has been omitted from 
the model, the solution is to modify the model. In Chapter 6 and later chapters, we discuss 
multiple regression analysis in which two or more predictor variables are utilized. 

Outlying Observations 

When outlying observations are present, as in Figure 3.7a, use of the least squares and 
maximum likelihood estimators (1.10) for regression model (2.1) may lead to serious dis¬ 
tortions in the estimated regression function. When the outlying observations do not repre¬ 
sent recording errors and should not be discarded, it may be desirable to use anjestimation 
procedure that places less emphasis on such outlying observations. We discuss one such 
robust estimation procedure in Chapter 11. 


3.9 Transformations_ 

We now consider in more detail the use of transformations of one or both of the original 
variables before carrying out the regression analysis. Simple transformations of either the 
response variable Y or the predictor variable X, or of both, are often sufficient to make the 
simple linear regression model appropriate for the transformed data. 

Transformations for Nonlinear Relation Only 

We first consider transformations for linearizing a nonlinear regression relation when the 
distribution of the error terms is reasonably close to a normal distribution and the error 
terms have approximately constant variance. In this situation, transformations on X should 
be attempted. The reason why transformations on Y may not be desirable here is that a 
transformation on Y, such as Y ! = \/F, may materially change the shape of the distribution 
of the error terms from the normal distribution and may also lead to substantially differing 
error term variances. 

Figure 3.13 contains some prototype nonlinear regression relations with constant error 
variance and also presents some simple transformations on X that may be helpful to lin¬ 
earize the regression relationship without affecting the distributions of Y. Several alternative 
transformations may be tried. Scatter plots and residual plots based on each transformation 
should then be prepared and analyzed to decide which transformation is most effective. 

I 

Example Data from an experiment oh the effect of number of days of training received (X) on 

- performance (y) in a battery of simulated sajes situations are presented in Table 3.7, 

columns 1 and 2, for the 10 participants in the study. A scatter plot of these data is shown in 
Figure 3.14a. Clearly the regressioH relation appears to be-curvilinear, so the simple linear 
regression model (2.1) does not seem,to be appropriate. Since the variability at the different 
X levels appears to be fairly constant, we shall consider a transformation on X. Based on 
the prototype plot in Figure 3.13a, we shall consider initially the square root transformation 
X r = Vx. The transformed values are shown in column 3 of Table 3.7. 


130 Part One Simple Linear Regression 


Sales 

Trainee 


⑴ 

(2) 

⑶ 

Day s of 
training 

Performance 

Score 


Xi 

Yi 

X 卜 y/% 

.5 

42.5 

.70711 

.5 

50.6 

•70711 

1.0 

68.5 

1.00000 

1.6 

80.7 

1^00000 

1.5 

89.0 

1.22474 

1.5 

99.6 

1.22474 

2.0 

105.3 

1.41421 

2;0 

111.8 

1.41421 

2.5 

112.3 

T.58T14 

2:5 

125.7 

1.58114 


TABLE 3.7 
Use of Square 
Root Transfor¬ 
mation of X to 
Linearize 
Regression 
Relation —— 
Sales Training 
Example. 


In Figure 3.14b，the same data are plotted with the predictor variable transformed to 
X 1 = y/X. Note that the scatter plot now shows a reasonably linear relation. The variability 
of the scatter at the different X levels is the same as before, since we did not make a 
transformation on Y. 

To examine further whether the simple linear regression model (2.1) is appropriate now, 
we fit it to the transformed X data. The regression calculations with the transformed X data 
are carried out in the usual fashion, except that the predictor variable now is X f . We obtain 
the following fitted regression function: 

Y = -10.33+ 83.45 

Figure 3.14c contains a plot of the residuals against X r . There is no evidence of lack of 
fit or of strongly unequal error variances. Figure 3.14d contains a normal probability plot of 








-11.0 I _I_ ! _i_L_ 

0.6 0.8 1.0 1.2 1.4 

4x 


the residuals. No strong indications of substantial departures from normality are indicated 
by this plot. This conclusion is supported by the high coefficient of correlation between the 
ordered residuals and their expected values under 'normality, .979. For a =k.01, Table B.6 
shows that the critical value is .879^ so the observed coefficient is substantially larger 
and supports the reasonableness of normal error terms. Thus, the simple linear regression 
model (2.1) appears to be appropriate here for the transformed data. 

The fitted regression function in the original units of X can easily be obtained, if desired: 




-5 


-10 








1.6 


-10 


~5 


0 

Expected 


5 


10 


30 


0 


1 2 
Days 


3 ° 0：6 


0.8 


1,0 1.2 

4X 


1.4 1.6 


(c) Residual Plot against 4X 


ll.Or 


(d) Normal Probability Plot 


10 


Chapter 3 Diagnostics and Remedial Measures 131 

FIGURE 3.14 Scatter Plots and Residual Plots 一 Sales Trmning Example. 



(a) Scatter Plot 


(b) Scatter Plot against 4x 

140r 


140 r 



lenpis- 


6 . 


2 -' 


z 


IraDpiss^ 


8 6 4 
-I 

^u 匚 raoucyj^d 


2 

5 


6 4 
9 7 

^UUALLUqp^cl 


52 


Y = -10.33 + 83.45VX 



132 Part One Simple Linear Regression 


FIGURE 3.15 
Prototype 
Regression 
Patterns with 
Unequal Error 
Variances and 
Simple Trans¬ 
formations 
of r. 


Prototype Regression Pattern 



r = J 


W = log 10 ^ 

MY 

Note; A simultaneous transformation on K may also be helpful or necessary. 


Comment 

At times, it may be helpful to introduce a constant into the transformation. For example, if some of 
the X data are near zero and the reciprocal transformation is desired, we can shift the origin by using 
the transformation X' = 1/(X -h k), where k is an appropriately chosen constant. ■ 


Transformations for Nonnormality and Unequal Error Variances 

Unequal error variances and nonnormality of the error terms frequently appear together. 
To remedy these depaitures from the simple linear regression model (2.1), we need a 
transformation on r, since the shapes and spreads of the distributions of Y need to be 
changed. Such a transformation on Y may also at the same time help to linearize a curvilinear 
regression relation. At oilier times, a simultaneous transformation on X may be needed to 
obtain or maintain a linear regression relation. 

Frequently, the nonnormality and unequal variances departures from regression 
model (2.1) take the form of increasing skewness and increasing variability of the distribu¬ 
tions of the error terms as the mean response E{Y] increases. For example, in a regression 
of yearly household expenditures for vacations (F) on household income (X), there will 
tend to be more variation and greater positive skewness (i.e., some very high yearly vacation 
expenditures) for high-income households than for low-income households, who tend to 
consistently spend much less for vacations. Figure 3.15 contains some prototype regression 
relations where the skewness and ihe error variance increase with the mean response E{Y}. 
This figure also presents some simple transformations on IK that may be helpful for these 
cases. Several alternative transformations on Y maybe tried, as well as some simultaneous 
transformations on X. Scatter plots and residual plots should be prepared to determine the 
most effective tmnsformation(s). 

Example Data on age (X) and plasma level of a poly amine (F) for a portion of the 25 healthy 
children in a study are presented in columns 1 and 2 of Table 3.8. These data are plotted in 
Figure 3.16a as a scatter plot. Note the distinct curvilinear regression relationship, as well 
as the greater variability for younger children than for older ones. 



Chapter 3 Diagnostics and Remedial Measures 133 


TABLE 3.8 

Use of 

Logarithmic 

^ansform 3- 
(jonofF to 

Linearize 
R^ression 
Rdation and 
Stabilize Error 
Variance — 
plasma Levels 
Example. 



0) 

(2) 

⑶ 

Child 

Age 

Plasma Level 


• 

t 

X, 

Y } 

Yf = log 10 Y s 

1 

0 (newborn) 

13.44 

1.1284 

2 

0 (newborn) 

12.84 

1.1086 

3 

0 (newborn) 

11.91 

1.0759 

4 

0 (newborn) 

20.09 

1.3030 

5 

0 (newborn) 

15,60 

1.1931 

6 

1.0 

10.11 

1.0048 

7 

1.0 

11.38 

1.0561 

19 

3.0 

6.90 

.8388 

20 

3.0 

6.77 

.8306 

21 

4.0 

4.86 

.6866 

22 

4.0 

5.10 

.7076 

23 

4.0 

5.67 

.7536 

24 

4.0 

5.75 

.7597 

25 

4.0 

6.23 

.7945 


On the basis of the prototype regression pattern in Figure 3.15b, we shall first try the 
logarithmic transformation T = log 10 Y. The transformed Y values are shown in column 3 
of Table 3.8. Figure 3.16b contains the scatter plot with this transformation. Note that the 
transformation not only has led to a reasonably linear regression relation, but the variability 
at the different levels of X also has become reasonably constant. 

To further examine the reasonableness of the transformation Y' = log 10 Y, we fitted the 
simple linear regression model (2.1) to the transformed Y data and obtained: 

Y' = 1.135 - .1023X 

A plot of the residuals against X is shown in Figure 3.16c, and a normal probability plot of 
the residuals is shown in Figure 3.16d. The coefficient of correlation between the ordered 
residuals and their expected values under normality is .981. For a = .05, Table B.6 indicates 
that the critical value is .959 so that the observed coefficient supports the assumption of 
normality of the error terms. All of this evidence supports the appropriateness of regression 
model (2.1) for the transformed Y data. 

Comments } 

1. At times it may be desirable to introduce a constant into a transformadon of Y, such as when 

Y maybe negative. For instance, the logarithmic transformation to shift the origin in ¥ and make all 

Y observations positive would be Y' = log 10 (y + k), where k is an appropriately chosen constant 

2. When unequal error variances are present but the regression relation is linear, a transformation 

on Y may not be sufficient While such a transformation may stabilize the error variance, it will also 
change the linear relationship to a curvilinear one. A transformation on X may therefore also be 
required. This case can also be handled by using weighted least squares, a procedure explained in 
Chapter 11. ' ■ 



134 Part One Simple Linear Regression 



Box-Cox Transformations 

It is often difficult to determine from diagnostic plots，such as the one in Figure 3.16a fey 
the plasma levels example, which transformation of F is most appropriate for correcting 
skewness of the distributions of error terms, unequal error variances, and nonlinearity of the 
regression function. The Box-rCox procedure (Ref. 3.9) automatically identifies a transfor¬ 
mation from the family of power transformations on F. The family of power transformations 


is of the form: 


Chapter 3 Diagnostics and Remedial Measures 135 


Y f = Y k 


(3.33) 


where 入 is a parameter to be determined from the data- Note that this family encompasses 
the following simple transformations: 


入 = 2 
入 =-5 
入 = 0 

入 =—.5 
X = — 1.0 


r = Y 2 


Y' = V¥ 
r = i oge r 


Y f 





(by definition) 


(3.34) 


The normal error regression model with the response variable a member of the family of 
power transformations in (3.33) becomes: 


if = A) + A 兄 + &■ 


(3.35) 


Note that regression model (3.35) includes an additional parameter, X, which needs to be 
estimated. The Box-Cox procedure uses the method of maximum likelihood to estimate X, 
as well as the other parameters j6 0 , j6i, and cr 2 . In this way, the Box-Cox procedure identifies 
X, the maximum likelihood estimate of 入 to use in the power transformation. 

Since some statistical software packages do not automatically provide the Box-Cox max¬ 
imum likelihood estimate % for the power transformation, a simple procedure for obtaining 
入 using standard regression software can be employed instead. This procedure involves a 
numerical search in a range of potential X values; for example, X = —2, X = —1.75 , …， 
入 =1.75, X = 2. For each 入 value, the Yh observations are first standardized so that the 
magnitude of the error sum of squares does not depend on the value of X: 


\ ^ 2 (log, Y t ) X = 0 


(3.36) 


where: 



K { f= 




(3.36a) 

(3.36b) 


Note that K-i is the geometric mean of the Y t observations. 

Once the standardized observations W[ have been obtained for a given X value, they are 
regressed on the predictor variable X-and the error sum of squares SSE is obtained. It can be 
shown that the maximum likelihood estimate A is that value of X for which SSE is a minimum. 

If desired, a finer search can be conducted in the neighborhood of the 入 value that 
minimizes SSE. However, the Box-Cox procedure ordinarily is used only to provide a guide 
for selecting a transformation, so overly precise results are not needed. In any case, scatter 



136 Part One Simple Linear Regression 


Example 


TABLE 3.9 
Box-Cox 
Results ― 
Plasma Levels 
Example. 


FIGURE 3.17 

SAS-JMP 
Box-Cox 
Results 一 
Plasma Levels 
Example. 


and residual plots should be utilized to examine the appropriateness of the transformation 
identified by the Box-Cox procedure. 


Table 3.9 contains the Box-Cox results for the plasma levels example. Selected values of 入， 
ranging from —1.0 to 1,0, were chosen, and for each chosen A the transformation (3.36) 
was made and the linear regression of W on X was fitted. For instance, for X = .5, the 
transformation (VF,_ — 1) was made and the linear regression of W on X was fitted. 

For this fitted linear regression, the error sum of squares is SSE = 48.4, The transformation 
that leads to the smallest value of SSE corresponds to 入 = 一 .5, for which SSE = 30.6. 

Figure 3.17 contains the SAS-JMP Box-Cox results for this example. It consists of a 
plot of SSE as a function of A. From the plot, it is clear that a power value near X = —.50 
is indicated. However, SSE as a function of X is fairly stable in the rang^lffom near 0 to 
— 1.0, so the earlier choice of the logarithmic transformation Y' = log 10 Y for the plasma 
levels example, corresponding to 人 = 0, is not unreasonable according to the Box-Cox 
approach. One reason the logarithmic transformation was chosen here is because of the 
ease of interpreting it The use of logarithms to base lOjrather than natural logarithms does 
not, of course, affect the appropriateness of the logarithmic transformation. 


Comments 

1. At times, theoretical or a priori considerations can be utilized to help in choosing an appropriate 
transformation. For example, when the shape of the scatter in a study of the relation between price of a 
commodity (X) and quantity demanded (K) is that in Figure 3.15b, economists may prefer logarithmic 
transformations of both ¥ and X because the slope of the regression line for the transformed variables 
then measures the price elasticity of demand. The slope is then commonly interpreted as showing the 
percent change in quantity demanded per 1 percent change in price, where it is understood that the 
changes are in opposite directions. 


A 

SSE 

% 

SS£ 

1.0 

- % 、 

78.0 


33;1 

B 

70.4 

j . ^.；驀 . 

31.2 

J 

57.8 

二 .:4 

30.7 

.5 

48.4 

— ； 5 ； 

m 

.3 

44,4 

-.6 

30.7 

.1 

36.4 

—vZ 

31.1 

0 

34.5 

-)A) 

32:7 

33.9 






Chapter 3 Diagnostics and Remedial Measures 137 


Similarly, scientists may prefer logarithmic transformations of both Y and X when studying the 
relation between radioactive decay (V) of a substance and time (X) for a curvilinear relation of the 
type illustrated in Figure 3.15b because the slope of the regression line for the transformed variables 
then measures the decay rate. 

2. After a transformation has been tentatively selected, residual plots and other analyses described 
earlier need to be employed to ascertain that the simple linear regression model (2.1) is appropriate 
for the transformed data. 

3. When transformed models are employed, the estimators b 0 and bi obtained by least squares 
have the least squares properties with respect to the transformed observations, not the original ones. 

4. The maximum likelihood estimate of A with the Box-Cox procedure is subject to sampling 
variability. In addition, the error sum of squares SSEis often fairly stable in a neighborhood around the 
estimate. It is therefore often reasonable to use a nearby 入 value for which the power transformation 
is easy to understand. For example, use of 入 = 0 instead of the maximum litelihood estimate 
X = .13 or use of 入 =—.5 instead of A = —.79 may facilitate understanding without sacrificing 
much in terms of the effectiveness of the transformation. To detmnine the reasonableness of using 
an easi 改 -to-understand value of A, one should examine the flatness of the likelihood fiinction in 
the neighborhood of A, as we did in the plasma levels example. Alternatively, one may consmict an 
approximate confidence interval for A; the procedure for constructing such an interval is discussed in 
Reference 3.10. 

5. When the Box-Cox procedure leads to a 入 value near 1, no transformation of Y maybe needed. 


3.10 Exploration of Shape of Regression Function 


Scatter plots often indicate readily the nature of the regression function. For instance. 
Figure 1.3 clearly shows the curvilinear nature of the regression relationship between steroid 
level and age. At other times, however, the scatter plot is complex and it becomes difficult to 
see the nature of the regression relationship, if any, from the plot. In these cases, it is helpful 
to explore the nature of the regression relationship by fitting a smoothed curve without any 
constraints on the regression function. These smoothed curves are also called nonparametric 
regression curves. They are useful not only for exploring regression relationships but also 
for confirming the nature of the regression function when the scatter plot visually suggests 
the nature of the regression relationship. 

Many smoothing methods have been developed for obtaining smoothed curves for time 
series data, where the X t denote time periods that are equally spaced apart The method of 
moving averages uses the mean of the Y observations for adjacent time periods to obtain 
smoothed values. For example, the mean of the Y values for the first three time periods 
in the time series might constitute the^ first smoothed value corresponding to the middle 
of the three time periods, in other words, corresponding to time period 2. Then the mean 
of the Y values for the second, third, and fourth time periods would constitute the second 
smoothed value, corresponding to the middle of these three time periods, in other words, 
corresponding to time period 3, and so on. Special procedures are required for obtaining 
smoothed values at the two ends of the time series. The larger the successive neighborhoods 
used for obtaining the smoothed values, the smoother the curve will be. 

The method of running medians is similar to the method of moving averages, except 
that the median is used as the average measure in order to reduce the influence of outlying 



138 Part One Simple Linear Regression 


observations. With this method, as well as with the moving average method, successive 
smoothing of the smoothed values and other refinements may be undertaken to provide a 
suitable smoothed curve for the time series. Reference 3.11 provides a good introduction 
to the running median smoothing method. 

Many smoothing methods have also been developed for regression data when the X 
values are not equally spaced apart. A simple smoothing method, band regression ，divides 
the data set into a number of groups or “bands” consisting of adjacent cases according to 
their X levels. For each band, the median X value and the median Y value are calculated, 
and the points defined by the pairs of these median values are then connected by straight 
lines. For example, consider the following simple data set divided into three groups; 





Median 

Median 

X 

y 

X 

Y 

2.0 

13.1 

2.7 


3.4 

15.7 



3.7 

14.9 



4.5 

16.8 

4.5 

16.8 

5.0 

17.1 



5,2 

16.9 

5.55 

17.35 

5,9 

17.8 


The three pairs of medians are then plotted on the scatter plot of the data and connected by 
straight lines as a simple smoothed nonparametric regression curve. 

Lowess Method 

The lowess method, developed by Cleveland (Ref. 3.12), is a more refined nonparametric 
method than band regression. It obtains a smoothed curve by fitting successive linear re¬ 
gression functions in local neighborhoods. The name lowess stands for locally weighted 
regression scatter plot smoothing. The method is similar to the moving average and running 
median methods in that it uses a neighborhood around each X value to obtain a smoothed 
Y value corresponding to that X value. It obtains the smoothed Y value at a given X by 
fitting a linear regression to the data in the neighborhood of the X value and then using the 
fitted value at X as the smoothed value. To illustrate this concretely, let (X|, denote the 
sample case with the smallest X value, (X 2 , Y 2 ) denote the sample case with the second 
smallest X value, and so on. If neighborhoods of three X values are used with the lowess 
method, then a linear regression would be fitted to the data: 

(u) (X 2 , Y 2 ) (x 3 , f 3 ) 

The fitted value at X 2 would constitute the smoothed value corresponding to X 2 . Another 
linear regression would be fitted to the data: 

(X 2 , F 2 ) (X 3 , Y 3 ) (X 4 , F 4 ) 



Chapter 3 Diagnostics and Remedial Measures 139 


and the fitted value at X 3 would constitute the smoothed value corresponding to X 3 . 
Smoothed values at each end of the X range are also obtained by the lowess procedure. 

The lowess method uses a number of refinements in obtaining the final smoothed values 
to improve the smoothing and to make the procedure robust to outlying observations. 

1. The linear regression is weighted to give cases further from the middle X level in each 
neighborhood smaller weights. 

2. To make the procedure robust to outlying observations, the linear regression fitting is 
repeated, with the weights revised so that cases that had large residuals in the first fitting 
receive smaller weights in the second fitting. 

3. To improve the robustness of the procedure further, step 2 is repeated one or more 
times by revising the weights according to the size of the residuals in the latest fitting. 

To implement the lowess procedure, one must choose the size of the successive neigh¬ 
borhoods to be used when fitting each linear regression. One must also choose the Weight 
function that gives less weight to neighborhood cases with X values far from each center 
X level and another weight function that gives less weight to cases with large residuals. 
Finally, the number of iterations to make the procedure robust must be chosen. 

In practice, two iterations appear to be sufficient to provide robustness. Also, the weight 
functions suggested by Cleveland appear to be adequate for many circumstances. Hence, 
the primary choice to be made for a particular application is the size of the successive 
neighborhoods. The larger the size, the smoother the function but the greater the danger 
that the smoothing will lose essential features of the regression relationship. It may require 
some experimentation with different neighborhood sizes in order to find the size that best 
brings out the regression relationship. We explain the lowess method in detail in Chapter 11 
in the context of multiple regression. Specific choices of weight functions and neighborhood 
sizes are discussed there. 

Figure 3.18a contains a scatter plot based on a study of research quality at 24 research 
laboratories. The response variable is a measure of the quality of the research done at the 
laboratory, and the explanatory variable is a measure of the volume of research performed 
at the laboratory. Note that it is very difficult to tell from this scatter plot whether or not a 
relationship exists between research quality and quantity. Figure 3.18b repeats the scatter 
plot and also shows the lowess smoothed curve. The curve suggests that there might be 
somewhat higher research quality for medium-sized laboratories. However, the scatter is 
great so that this suggested relationship should be considered only as a possibility. Also, 
because any particular measures of research quality and quantity are so limited, other 
measures should be considered to see if these corroborate the relationship suggested in 
Figure 3.18b. 1 

Use of Smoothed Curves to Confirm Fitted Regression Function 

Smoothed curves are useful not only in the exploratory stages^when a regression model is 
selected but they are also helpful in confirming the regression function chosen. The proce¬ 
dure for confirmation is simple; The smoothed curve is plotted together with the confidence 
band for the fitted regression function. If the smoothed curve falls within the confidence 
band, we have supporting evidence of the appropriateness of the fitted regression function. 


Example 



140 Part One Simple Linear Regression 




Figure 3.19a repeats the scatter plot for the Toluca Company example from Figure 1.10a 
and shows the lowess smoothed curve. It appears that the regression relation is linear or 
possibly slightly curved. Figure 3.19b repeats the confidence band for the regression line 
from Figure 2.6 and shows the lowess smoothed curve. We see that the smoothed curve falls 
within the confidence band for the regression line and thereby supports the appropriateness 
of a linear regression function. 

Comments 

1. Smoothed curves, such as the lowess curve, do not provide an analytical expression for the 
functional form of the regression relationship. They only suggest the shape of the regression curve. 

2. The lowess procedure is not restricted to fitting linear regression functions in each neighboriiood. 
Higher-degree polynomials can also be utilized with this method. 


FIGURE 3.18 
MINHAB 
Scatter Plot 
and Lowess 
Smoothed 
Curve — 
Research 
Laboratories 
Example. 


FIGURE 3.19 

MINITAB 
Lowess Curve 
and Confidence 
Band for 
Regression 
Line — Tbluca 
Company 
Example. 


Example 


/\ /\ Ji Ji /\ Ji 
o o o o o o 

6 5 4 3 2 1 
SJnoH 



Chapter 3 Diagnostics and Remedial Measures 141 


3. Smoothed curves are also useful when examining residual plots to ascertain whether the resid- 
uals (or the absolute or squared residuals) follow some relationship with X or Y. 

4. References 3.13 and 3.14 provide good introductions to other nonparametric methods in re¬ 
gression analysis. ■ 


3.11 Case Example — Plutonium Measurement_ 

Some environmental cleanup work requires that nuclear materials, such as plutonium 238, 
be located and completely removed from a restoration site. When plutonium has become 
mixed with other materials in very small amounts, detecting its presence can be a difficult 
task. Even very small amounts can be traced, however, because plutonium emits subatomic 
particles — alpha particles~that can be detected. Devices that are used to detect plutonium 
record the intensity of alpha particle strikes in counts per second (#/sec). The regression rela¬ 
tionship between alpha counts per second (the response variable) and plutonium activity (the 
explanatory variable) is then used to estimate the activity of plutonium in the material under 
study. This use of a regression relationship involves inverse prediction [i.e.，predicting plu¬ 
tonium activity (X) from the observed alpha count (F)], a procedure discussed in Chapter 4. 

， The task here is to estimate the regression relationship between alpha counts per second 
and plutonium activity. This relationship varies for each measurement device and must be 
established precisely each time a different measurement device is used. It is reasonable to 
assume here that the level of alpha counts increases with plutonium activity, but the exact 
nature of the relationship is generally unknown. 

In a study to establish the regression relationship for a particular measurement device, 
four plutonium standards were used These standards are aluminum/plutoniuin rods con¬ 
taining a fixed, known level of plutonium activity. The levels of plutonium activity in the 
four standards were 0.0,5.0,10.0, and 20.0 picocuries per gram (pCi/g). Each standard was 
exposed to the detection device from 4 to 10 times, and the rate of alpha strikes, measured 
as counts per second, was observed for each replication. A portion of the data is shown 
in Table 3.10, and the data are plotted as a scatter plot in Figure 3.20a. Notice that, as 
expected, the strike rate tends to increase with the activity level of plutonium. Notice also 
that nonzero strike rates are recorded for the standard containing no plutonium. This results 
from background radiation and indicates that a regression model with an intercept term is 
required here. 


TABLE 3.10 



' : 

Dfitfi_ 


Plutonium 

Alpha Count 

Flutoniuin 

Case 

Activity 

Rate 

Measurement 


(pCi/g) 

(#/sefc) 

Example. 

1 

20 

.150 


2 

0 

.004 


3 

10 

$ 

m 

■0 


22 

0 

xm 


23 

5 

.049 


24 

o 

.106 



142 Part One Simple Linear Regression 


FIGURE 3.20 
SAS-JMP 
Scatter Plot 
and Low ess 
Smoothed 
Curve ■— 
Plutonium 
Measurement 
Example. 


(a) Scatter Plot 



pCi/g 


(b) Lowess Smoothed Curve 



As an initial step to examine the nature of the regression relationship, a lowess smoothed 
curve was obtained; this curve is shown in Figure 3.20b. We see that the regression rela¬ 
tionship may be linear or slightly curvilinear in the range of the plutonium activity levels 
included in the study. We also see that one of the readings taken at 0.0 pCi/g (case 24) does not 
appear to fit with the rest of the observations. An examination of laboratory records revealed 
that the experimental conditions were not properly maintained for the last case, and it was 
therefore decided that case 24 should be discarded. Note, incidentally, how robust the lowess 
smoothing process was here by assigning very little weight to the outlying observation. 

A linear regression function was fitted next, based on the remaining 23 cases. The SAS- 
JMP regression output is shown in Figure 3.21a, a plot of the residuals against the fitted 
values is shown in Figure 3.21b, and a normal probability plot is shown in Figure 3.21c. 
The JMP output uses the label Model to denote the regression component of the analysis 
of variance; the label C Total stands for corrected total. We see from the regression output 
that the slope of the regression line is not zero (F* = 228.9984, 尸 -value = .0000) so that a 
regression relationship exists. We also see from the flared, megaphone shape of the residual 
plot that the error variance appears to be increasing with the level of plutonium activity. 
The normal probability plot suggests nonnormality (heavy tails), but the nonlinearity of the 
plot is likely to be related (at least in part) to the unequal error variances. The existence of 
nonconstant variance is confirmed by the Breusch-Pagan test statistic (3.11): 

X 2 bp = 23.29 > x 2 (-95; 1) = 3.84 

The presence of nonconstant variance clearly requires remediation. A number of ap¬ 
proaches could be followed, including the use of weighted least squares discussed in Chap¬ 
ter 11. Often with count data, the error variance can be stabilized through the use of a 
square root transformation of the response variable. Since this is just one in a range of 
power transformations that might be useful, we shall use the Box-Cox procedure to suggest 
an appropriate power transformation. Using the standardized variable (3.36), we find the 
maximum likelihood estimate of X to be X = .65. Because the likelihood function is fairly 
flat in the neighborhood of X = .65, the Box-Cox procedure supports the use of the square 
root transformation (i.e., use of 入 =.5). The results of fitting a linear regression function 
when the response variable is Y' = VF are shown in Figure 3.22a. 



Chapter 3 Diagnostics and Remedial Measures 143 


- 0.02 

-0.03 

-0.04 


(b) 

Residual Plot 




I I I I 


0.00 0.02 0.05 0.07 0.10 0.12 

Fitted 


0.04 r 


0.03 
0.02 
_ 0.01 
1 0.00 




■ 0.01 
- 0.02 

-0.03 

—0.04 


(c) 

Normal Probability Plot 




—0.03 -0.02 -0.01 0.00 0.01 0.02 0.03 
Expected 


At this point a new problem has arisen. Although the residual plot in Figure 3.22b shows 
that the error variance appears to be more stable and the points in the normal probability 
plot in Figure 3.22c fall roughly on a straight line, the nesidual plot now suggests that Y 1 
is nonlinearly related to X. This concern is confirmed by the lack of fit test statistic (3.25) 
(F* = 10.1364, P-value = .0010). Qf course, this result is not completely unexpected, 
since Y was linearly related taX. 

To restore a linear relation with the transformed Y variable, we shall see if a square root 
transformation of X will lead to a satisfactory linear fit. The regression results when re¬ 
gressing Y' = VF on X' = Vx are presented in Figure 3.23. Notice from the residual plot 
in Figure 3.23b that the square root transformation of the predictor variable has eliminated 
the lack of fit. Also, the normal probability plot of the residuals in Figure 3.23c appears 
to be satisfactory, and the correlation test (r = .986) suppcats the assumption of normally 
distributed error terms (the interpolated critical value in T^ble B.6 fora = .05 and n =23 
is .9555). However, the residual plot suggests that some nonconstancy of the error variance 


FIGURE S.21 SAS-JMP Regression Output and Diagnostic Plots for Untransformed Data~Plutonium 
Measurement Example. 

(a) Regression Output 


Term 

Estimate 

Std Error 

t Ratio 

Prob>|tj 

Intercept 

0.0070331 

0.0036 

1.95 

0.0641 

Plutonium 

0.005537 

0.00037 

15.13 

0.0000 

Source 

DF 

Sum of Squares 

Mean Square 

F Ratio 

Model 

1 

0.03619042 

0.036190 

228.9984 

Error 

21 

0.00331880 

0.000158 

Prob>F 

C Total 

22 

0.03950922 


0.0000 

Source 

DF 

Sum of Squares 

Mean Square 

F Ratio 

Lack of Fit 

2 

0.00016811 

0.000084 

0.5069 

Pure Error 

19 

0.00315069 

0.000166 

Prob>F 

Total Error 

21 

0.00331880 


0.6103 




4 3 2 1 o 1 
o o o o o o 

do.do.o.d 

I 

ranptsy 



144 Part One Simple Linear Regression 


0.07 


0.05 


0.02 


0.00 


- 0.02 


— 0,05 


-0.07 


- (c) 

Normal Probability Plot 


••• 


•• 


j 


-0.07 —0.05 —0.02 0.00 0.02 0.05 0.07 


Expected 


(b) 

Residual Plot 


— 0.02 

-0.05 

-0.07 


may still remain; but if so, it does not appear to be substantial. The Breusch-Pagan test statis¬ 
tic (3.11) is X\ P = 3.85, which corresponds to a P-value of .05, supporting the conclusion 
from the residual plot that the nonconstancy of the error variance is not substantial. 

Figure 3.23d contains a SYSTAT plot of the confidence band (2.40) for the fitted regres¬ 
sion line: 

Y' = .0730 + .0573X' 

We see that the regression line has been estimated fairly precisely. Also plotted in this figure 
is the lowess smoothed curve. This smoothed curve falls entirely within the confidence band, 
supporting the reasonableness of a linear regression relation between Y' and X'. The lack of 
fittest statistic (3.25) now is F* = 1.2868 ( 尸 -value = .2992), also supporting the linearity 
of the regression relating Y' = ^/Y to X' = VX. 


FIGURE 3.22 SAS-JMP Regression Output and Diagnostic Plots for Transformed Response 
Variable ■— Hutonium Measurement Example. 


(a) Regression Output 


Term 

Estimate 

Std Error 

t Ratio 

Prob>|t| 

Intercept 

0.0947596 

0.00957 

9.91 

0.0000 

Plutonium 

0.0133648 

0.00097 

13.74 

0.0000 

Source 

DF 

1 

Sum of Squares 

Mean Square 

F Ratio 

Model 

0.21084655 

0.210847 

188.7960 

Error 

21 

0.02345271 

0.001117 

Prob>F 

C Total 

22 

0.23429926 


0.0000 

Source 

DF 

2 

Sum of Squares 

Mean Square 

F Ratio 

Lack of Fit 

0.01210640 

0.006053 

10.1364 

Pure Error 

19 

0.011 34631 

0.000597 

Prob>F 

Total Error 

21 

0.02345271 


0.0010 




M 0 


• •• • • 


■ 4 


:ed 


7 5 

o o 
do. 


0 . 020.00 


|enp J 



—0.04 


— 0.02 


-0.04 




—0.06 


0.05 0.10 0.15 0.20 0.25 0.30 035 
Fitted 


-0.06 


丄 


J 


-0.07 -0.05 - 0.02 0.00 0.02 0.05 0.07 
Expected 


(d) 

Confidence Band for Regression 
Line and Lowess Curve 


FIGURE 3.23 SAS-JMP Regression Output and Diagnostic Hots for Transformed Response and Predictor 
Variables 一 Plutonium Measurement Example. 

(a) Regression Output 


Term 

Estimate 

Std Error 

t Ratio 

Prob>|t| 

Intercept 

0.0730056 

0.00783 

9.32 

0.0000 

Sqrt Plutonium 

0.0573055 

0.00302 

19.00 

0.0000 

Source 

DF 

Sum of Squares 

Mean Square 

F Ratio 

Model 

1 

0.22141612 

0.221416 

360.9166 

Error 

21 

0.01288314 

0.00061 3 

Prob>F 

C Total 

22 

0.23429926 


0.0000 

Source 

DF 

2 

Sum of Squares 

Mean Square 

F Ratio 

Lack of Fit 

0.00153683 

0.000768 

1.2868 

Pure Error 

19 

0.011 34631 

0.000597 

Prob>F 

Total Error 

21 

0.01288314 


0.2992 


(b) 

Residual Plot 


0.06 r 


0.06 r 


(c) 

Normal Probability Plot 




0402.00.02 

o' o.°-o. 


ip!sy 


04 2 o 

d 


I 


o o 
d a 


0.4 





146 Part One Simple Linear Regression 


Cited 

References 


Problems 


3.1. Barnett, V., and T, Lewis. Outliers in Statistical Data. 3rd ed. New York: John Wiley & Sons, 
1994. 

3.2. Looney, S. W., and T. R. Gulledge, Jr. “Use of the Correlation Coefficient with Normal 
Probability Plots,” The American Statistician 39 (1985), pp. 75-79. 

3.3. Shapiro, S. S., and M. B. Wilk. “An Analysis of Variance Test for Normality (Complete 
Samples)，’’ Biometrika 52 (1965), pp. 591-611. 

3.4. Levene, H. “Robust Tests for Equality of Variances，’’ in Contributions to Probability and 
Statistics, ed. L Olkin. Palo Alto, Calif.: Stanford University Press, 1960, pp. 278-92. 

3.5. Brown, M. B„ and A. B. Forsythe. “Robust Tests for Equality of Variances，’’ Journal of the 
American Statistical Association 69 (1974), pp. 364—67. 

3.6. Breusch, T. S., and A. R. Pagan. “A Simple Test for Heteroscedasticity and Random Coefficient 

Variation,” Econometrica 47 (1979), pp. 1287-94. ^ 

3.7. Cook, R. D., and S. Weisberg. "Diagnostics for Heteroscedasticity in Regression,” Biometrika 
70 (1983), pp. 1—10. 

3.8. Joglekar, G., J. H. Schuenemeyer, and V. LaRiccia. <4 Lack-of-Fit Testing When Replicates Are 
Not Available," The American Statistician 43 (1989), pp. 135-43. 

3.9. Box, G. E. P., and D. R. Cox. “An Analysis of Transformations,” Journal of the Royal Statistical 

Society B 26 (1964), pp. 211—43. ’’ 

3.10. Draper, N. R” and H, Smith. Applied Regression Analysis. 3rd ed. New York: John Wiley & 
Sons, 1998. 

3.11. Velleman, P. F. f and D. C. Hoaglin. Applications, Basics, and Computing of Exploratory Data 
Analysis. Boston: Duxbury Press, 1981. 

3.12. Cleveland, W. S. “Robust Locally Weighted Regression and Smoothing Scatterplots, ,> Journal 
of the American Statistical Association 1A (1979), pp. 829-36. 

3.13. Altman, N. S. 4< An Introduction to Kernel and Nearest-Neighbor Nonparametric Regression，’’ 
The American Statistician 46 (1992), pp. 175-85. 

3.14. Haerdle, W. Applied Nonparametric Regression. Cambridge: Cambridge University Press, 
1990. 


3.1. Distinguish between (1) residual and semistudentized residual, (2) = 0 and e = 0, 

(3) error term and residual. 

3.2. Prepare a prototype residual plot for each of the following cases: (1) error variance decreases 

with X; (2) true regression function is U shaped, but a linear regression function is fitted. 

3.3. Refer to Grade point average Problem 1.19. 

a. Prepare a box plot for the ACT scores X iw Are there any noteworthy features in this plot? 

b. Prepare a dot plot of the residuals. What information does this plot provide? 

c. Plot the residual e ； against the fitted values %. What departures from regression model (2.1) 
can be studied from this plot? What are your findings? 

d. Prepare a normal probability plot of the residuals. Also obtain the coefficient of correlation 

between the ordered residuals and their expected values under normality. Test the reason¬ 
ableness of the normality assumption here using Table B.6 and a = .05. What do you 
conclude? - 

e. Conduct the Brown-Forsythe test to determine whether or not the error variance varies with 
the level of X. Divide the data into the two groups, X <26, X > 26, and use a = .01. State 
the decision rule and conclusion. Does your conclusion support your preliminary findings 
in part (c)? 





Chapter 3 Diagnostics and Remedial Measures 147 


f. Information is given below for each sfijdent on two variables not included in the model, 
namely, intelligence test score (X 2 ) and high school class rank percentile (X 3 ). (Note that 
larger class rank percentiles indicate high^ standing in the class, e.g., 1% is near the bottom 
of the class and 99% is near the top of the class.) Plot the residuals against Xi and on 
separate graphs to ascertain whether the model can be improved by including eititi 改 of these 
variables. What do you conclude? 

i: 1 2 3 … 118 119 120 

X 2 : 122 132 119 ... 140 111 110 

X 3 : 99 71 75 ... 97 65 85 

*3.4. Refer to Copier maintenance Problem 1.20. 

a. Prepare a dot plot for the number of copiers serviced X；. What information is provided by 

this plot? Are there any outlying cases with respect to this variable? ^ 一 

b. The cases are given in time order. Prepare a time plot for the number of copiers serviced. 
What does your plot show? 

c. Prepare a stem-and-leaf plot of the residuals. Are there any noteworthy features in this plot? 

d. Prepare residual plots of versus and e ； versus on separate graphs. Do these plots 
provide the same information? What departures from regression model (2.1) can be studied 
from these plots? State your findings. 

e. Prepare a normal probability plot of the residuals. Also obtain the coefficient of correlation 
between the ordered residuals and their expected values under normality. Does the normality 
assumption appear to be tenable here? Use Table B.6 and a = .10. 

f. Prepare a time plot of the residuals to ascertain whether the error terms are correlated over 
time. What is your conclusion? 

g. Assume that (3.10) is applicable and conduct the Breusch-Pagan test to determine whether 
or not the error variance varies with the level of X. Use a = .05. State the alternatives, 
decision rule, and conclusion. 

h. Information is given below on two variables not included in the regression model, namely, 
mean operational age of copiers serviced on the call (X 2 ,in months) and years of experience 
of the service person making the call (X 3 ). Plot the residuals against X 2 and on separate 

' graphs to ascertain whether the model can be improved by including either or both of these 
variables. What do you conclude? 

i: 1 2 3 43 44 45 

X 2 ： 20 19 27 ... 28 26 33 

入 3 : 4 5 4 … 3 3 6 

t 

*3.5. Refer to Airfreight breakage Problem 1.21. 

a. Prepare a dot plot for the number of tran sfersX；. Does the distribution of number of transfers 
appear to be asymmetrical? - 

h. The cases are given in time order. Prepare a time plot for the number of transfers. Is any 
systematic patten evident in your plot? Discuss. 

c. Obtain the residuals e s and prepare a stem-and-leaf plot of the residuals. What information 
is provided by your plot? 



148 Part One Simple Linear Regression 


d. Plot the residuals e t against X ； to ascertain whether any departures from regression 
model (2.1) are evident. What is your conclusion? 

e. Prepare a normal probability plot of the residuals. Also obtain the coefficient of correlation 
between the ordered residuals and their expected values under normality to ascertain whether 
the normality assumption is reasonable here. Use Table B.6 and a = .01. What do you 
conclude? 

f. Prepare a time plot of the residuals. What information is provided by your plot? 

g. Assume that (3.10) is applicable and conduct the Breusch-Pagan test to determine whetho" 
or not the error variance varies with the level of X. Use a = .10. State the alt 改 natives, 
decision rule, and conclusion. Does your conclusion support your preliminary findings in 
part (d)? 


3.6. Refer to Plastic hardness Problem 1.22. 



a. Obtain the residuals e ； and prepare a box plot of the residuals. What information is provided 
by your plot? 

b. Plot the residuals e- t against the fitted values 7 ； to ascertain whether any departures from 
regression model (2.1) are evident. State your findings. 

c. Prepare a normal probability plot of the residuals. Also obtain the coefficient of correlation 
between the ordered residuals and their expected values under normality. Does the normality 
assumption appear to be reasonable here? Use Table B.6 and a. = .05. 

d. Compare the frequencies of the residuals against the expected frequencies under normality, 
using the 25th, 50th, and 75th percentiles of the relevant t distribution. Is the information 
provided by these comparisons consistent with the findings from the normal probability plot 
in part (c)? 

e. Use the Brown-Forsythe test to determine whether or not the error variance varies with the 
level of X. Divide the data into the two groups, X < 24, X > 24, and use a = .05. State 
the decision rule and conclusion. Does your conclusion support your preliminary findings 
in part (b)? 


*3.7. Refer to Musde mass Problem 1.27. 


a. Prepare a stem-and-leaf plot for the ages X；. Is this plot consistent with the random selection 
of women from each 10-year age group? Explain. 

b. Obtain the residuals e ； and prepare a dot plot of the residuals. What does your plot show? 

c. Plot the residuals e t against 7 ； and also against X ； on separate graphs to ascertain whether 
any departures from regression model (2.1) are evident. Do the two plots provide the same 
information? State your conclusions. 

d. Prepare a normal probability plot of the residuals. Also obtain the coefficient of correlation 
between the ordered residuals and their expected values under normality to ascertain whether 
the normality assumption is tenable here. Use Table B.6 and a — .10. What do yen conclude? 

e. Assume that (3.10) is applicable and conduct the Breusch-Pagan test to determine whether 
or not the error variance varies with the level of X. Use a = .01. State the alternatives, 
decision rule, and conclusion. Is your conclusion consistent with your preliminary findings 

/ in part (c)? 

3.8. Refer to Crime rate Problem 1.28. 


a. Prepare a stem-and-le^ plot for the percentage of individuals in the county having at least 
a high school diploma X- t . What information does your plot provide? 

b. Obtain the residuals e t and prepare a box plot of the residuals. Does the distribution of the 
residuals appear to be symmetrical? 



Chapter 3 Diagnostics and Remedial Measures 149 


c. Make a residual plot of e ； versus %. What does the plot show? 

d. Prepare a normal probability plot of the residuals. Also obtain the coefficient of correlation 
between the ordered residuals and their expected values unda~ normality. Tfest the reason¬ 
ableness of the normality assumption using Table B.6 and« = .05. What do you conclude? 

e. Conduct the Brown-Forsythe test to determine whether or not the error variance varies with 
the level of X. Divide the data into the two groups, X $ 69, X > 69, and use a = .05. State 
the decision rule and conclusion. Does your conclusion support your preliminary findings 
in part (c)? 

3.9. Electricity consumption. An economist studying the relation between household electricity 
consumption (7) and number of rooms in the home (X) employed linear regression model (2.1) 
and obtained the following residuals: 

/: 1 2 3 4 5 6 7 8 9 10 

X f : 2 3 4 5 6 7 8 9 10 ^ 11 

3.2 2.9 -1.7 -2.0 -23 -1.2 —.9 .8 .7 .5 

Plot the residuals ei against X/. What problem appears to be present here? Might a transforma¬ 
tion alleviate this problem? 

3.10. Per capita earnings. A sociologist employed linear regression model (2.1) to relate per capita 
earnings (7) to average number of years of schooling (X) for 12 cities. The fitted values and 
the semistudentized residuals ef follow. 

I ： 1 2 3 10 11 12 

?；: 9.9 9.3 10.2 ... 15.6 11.2 13.1 

巧 : -1.12 .81 -.76 ... —3.78 .74 32 

a. Plot the semistudentized residuals against the fitted values. What does the plot suggest? 

b. How many semistudentized residuals are outside 士 1 standard deviation? Approximately 
how many would you expect to see if the normal error model is appropriate? 

3.11. Drug concentration. A pharmacologist employed linear regression model (2.1) to study the 
relation between the concentration of a drug in plasma (7) and the log-dose of the drug (X). 
The residuals and log-dose levels follow. 

/: 1 2 3 4 5 6 7 8 9 

X；: -10 1—1 0 1-10 1 

e；: .5 2.1 —3.4 .3 —1.7 4.2 -.6 2.6 -4.0 

f 

a. Plot the residuals e,- against What conclusions do you draw from the plot? 

b. Assume that (3.10) is applicable and conduct the Breusch-Pagan test to determine whether 

or not the error variance varies with log-dose of the drug (X). Use a = .05. State the 
alternatives, decision rule, and conclusion. Does your conclusion support your preliminary 
findings in part (a)? * 

3.12. A student does not understand why the sum of squares defined in (3.16) is called a pure error 
sum of squares “since the formula looks like one for an ordinary sum of squares." Explain. 




150 Part One Simple Linear Regression 


*3.13. Refer to Copier maintenance Problem 1.20. 

a. What are the alternative conclusions when testing for lack of fit of a linear regression 
function? 

b. Perform the test indicated in part (a). Control the risk of Type I error at .05. State the decision 
rule and conclusion. 

c. Does the test in part (b) detect other departures from regression model (2.1), such as lack 
of constant variance or lack of normality in the error terms? Could the results of the test of 
lack of fit be affected by such deparUires? Discuss. 

3.14. Refer to Hastic hardness Problem 1.22. 


a. Perform the F test to determine whether or not there is lack of fit of a linear regression 
function; use a = .01. State the alternatives, decision rule, and conclusion. 

b. Is there any advantage of having an equal number of replications at each oftKe X levels? Is 
there any disadvantage? 

c. Does the test in part (a) indicate what regression function is appropriate when it leads to the 
conclusion that the regression function is not linear? How would you proceed? 


3.15. Solution concentration. A chemist studied the concentrafion cf a solution (7) over time (X). 
Fifteen identical solutions were prepared. The 15, solutions were randomly divided into five 
sets of three, and the five sets were measured, respectively, after 1, 3, 5, 7, and 9 hours. The 
results follow. 


/: 1 2 3 13 14 15 

X；: 9 9 9 1 1 1 

Yr. .07 .09 .08 ... 2.84 2.57 3.10 

a. Fit a linear regression function. 

b. Perform the F test to deta-mine whether or not there is lack of fit of a linear regression 
function; use a = .025. State the alternatives, decision rule, and conclusion. 

c. Does the test in part (b) indicate what regression function is appropriate when it leads to the 
conclusion that lack of fit of a linear regression function exists? Explain. 

3.16. Refer to Solution concentration Problem 3.15. 

a. Prepare a scatter plot of the data. What transformation of Y might you by, using the prototype 
patterns in Figure 3.15 to achieve constant variance and linearity? 

b. Use the Box-Cox procedure and standardization (3.36) to find an appropriate power 
transformation. Evaluate SSE for 人二 —.2, —.1,0, .1, .2. What transformation of Y is 
suggested? 

c. Use the transformation Y' = log 10 Y and obtain the estimated linear regression function for 
the transformed data. 

d. Plot the estimated regression line and the transformed data. Does the regression line appear 
to be a good fit to the transformed data? 

e. Obtain the residuals and plot them against the fitted values. Also prepare a normal probability 
plot. What do your plots show? 

f. Express the estimated regression function in the original units. 

*3.17. Sales growth. A marketing researcher studied annual sales of a product that had been introduced 

10 years ago. The data are as follows, where X is the year (coded) and Y is sales in thousands 



of units: 


Chapter 3 Diagnostics and Remedial Measures 151 


/: 1 2 3 4 5 6 7 8 9 10 


入, ： 0 1 2 3 4 5 6 7 8 9 

Y；: 98 135 162 178 221 232 283 300 374 395 


a. Prepare a scatter plot of the data. Does a linear relation appear adequate here? 

b. Use the Box-Cox procedure and standardization (3.36) to find an appropriate power transfor¬ 
mation of Y. Evaluate SSE for A = .3, ,4, .5, .6, .7. What transformation of Y is suggested? 

c. Use the transformation Y' — ^/Y and obtain the estimated linear regression function for the 
transformed data. 

d. Plot the estimated regression line and the transformed data. Does the regression line appear 
to be a good fit to the transformed data? 

e. Obtain the residuals and plot them against the fitted values. Also prepare a normal probability 
plot. What do your plots show? 

f. Express the estimated regression function in the original units. 

3.18. Production time. In a manufacturing study, the production times for 111 recent production 
runs were obtained. The table below lists for each run the production time in hours (y) and the 
production lot size (X). 

/: 1 2 3 ... 109 110 111 

X；: 15 9 7 ... 12 9 15 

n ： 14.28 8.80 12.49 ... 16.37 11.45 15.78 


a. Prepare a scatter plot of the data. Does a linear relation appear adequate here? Would a 
transformation on X or 7 be more appropriate here? Why? 

b. Use the transformation X' = \/X and obtain the estimated linear regression function for the 
transformed data. 

c. Plot the estimated regression line and the transformed data. Does the regression line appear 
to be a good fit to the transformed data? 

d. —Obtain the residuals and plot them against the fitted values. Also prepare a normal probability 
plot. What do your plots show? 

e. Express the estimated regression function in the original units. 


Exercises 


3.19. A student fitted a linear regression function for a class assignment. The student plotted the 
residuals e ； against Yi and found a positive relation. When the residuals were plotted against 
the fitted values Y；, the student found no relation. How could this difference arise? Which is 
the more meaningful plot? 

3.20. If the error terms in a regression model are independent N(0, <r 2 ), what can be said about the 
error terms after transformation X' = i/X is used? Is the situation the same after transformation 

= 1/y is used? * 

3.21. Derive the result In (3.29). * 

3.22. Using (A.70), (A.41), and (A.42), show that E{MSPE] = cr 2 for normal error regression 
model (2.1). 




152 Part One Simple Linear Regression 


3.23. A linear regression model with intercept j 6 o = 0 is under consideration. Data have been 
obtained that contain replications. State the fbll and reduced models for testing the appro¬ 
priateness of the i-egression function under consideration. What are the degrees of freedom 
associated with the full and reduced models if n = 20 and c — 10 ? 


Projects 


3.24. Blood pressure. The following data were obtained in a study of the relation between diastolic 
blood pressure (7) and age (X) for boys 5 to 13 years old. 


/: 1 2 3 4 5 6 7 8 


X；: 5 8 11 7 13 12 12 6 

Yj ： 63 67 74 64 75 69 90 60 


a. Assuming normal error regression model (2.1) is appropriate, obtain the estimated regression 
function and plot the residuals e { against What does your residual plot show? 

b. Omit case 7 from the data and obtain the estimated regression function based on the remaining 
seven cases. Compare this estimated regression function to that obtained in part (a). What 
can you conclude about the effect of case 7? 

c. Using your fitted regression function in part (b), obtain a 99 percent prediction interval for 
a new Y observation at X = 12. Does observation I 7 fall outside this prediction interval? 
What is the significance of this? 

3.25. Refer to the CDI data set in Appendix C.2 and Project 1.43. For each of the three fitted regression 
models, obtain the residuals and prepare a residual plot against X and a normal probability plot. 
Summarize your conclusions. Is linear regression model (2.1) more appropriate in one case than 
in the others? 

3.26. Refer to the CDI data set in Appendix C.2 and Project 1 .44. For each geographic region, obtain 
the residuals and prepare a residual plot against X and a normal probability plot. Do the four 
regions appear to have similar error variances? What other conclusions do you draw from your 
plots? 

3.27. Refer to the SENIC data set in Appendix C.l and Project 1.45. 

a. For each of the three fitted regression models, obtain the residuals and prepare a residual plot 
against X and a normal probability plot. Summarize your conclusions. Is linear regression 
model ( 2 . 1 ) more apt in one case than in the others? 

b. Obtain the fitted regression function for the relation between length of stay and infection 
risk after deleting cases 47 (X 47 = 6.5, 7 47 = 19.56) and 112 {X n2 = 5.9, Yu 2 = 17.94). 
From this fitted regression fonction obtain separate 95 percent prediction intervals for new 
V observations at X = 6.5 and X = 5.9, respectively. Do observations y 47 and 7^ fall 
outside these prediction intervals? Discuss the significance of this. 

3.28. Refer to the SENIC data set in Appendix C. 1 and Project 1.46. For each geographic region, 
obtain the residuals and prepare a residual plot against X and a normal probability plot. Do the 
four regions appear to have similar error variances? What oth 改 conclusions do you draw from 
your plots? 

3.29. Refer to Copier maintenance Problem 1.20. 

a. Divide the data into four bands according to the number of copiers serviced (X). Band 1 
ranges from X = .5 to X = 2.5; band 2 ranges from Jlf = 2.5 to X = 4.5; and so forth. 
Determine the median value of X and the median value of Y in each of the bands and develop 




Chapter 3 Diagnostics and Remedial Measures 153 


the band smooth by connecting the four pairs of medians by straight lines on a scatter plot 
of the data. Does the band smooth suggest that the regression relation is linear? Discuss. 

b. Obtain the 90 percent confidence band for the true regression line and plot it on the scatter 
plot prepared in part (a). Does the band smooth fall entirely inside the confidence band? 
What does this tell you about the appropriateness of the linear regression function? 

c. Create a series of six overlapping neighboiiioods of width 3.0 beginning at X = .5. The 
first neighborhood will range from X = .5 toX = 3.5; the second neighborhood will range 
from X = 1.5 to X = 4.5; and so on. For each of the six overlapping neighbortioods, fit a 
linear regression function and obtain the fitted value Y c at the center X c of the neighborhood. 
Develop a simplified vision of the lowess smooth by connecting the six (X ct 7 C ) pairs by 
straight lines on a scatter plot of the data. In what ways does your simplified lowess smooth 
differ from the band smooth obtained in part (a)? 

3.30. Refer to Sales growth Problem 3.17. 

a. Divide the range of the predictor variable (coded years) into five bands of width 2.0, as 
follows: Band 1 ranges from X = —.5toX = 1.5; band 2 ranges from X = 1.5 toX = B-5; 
and so on. Determine the median value of X and the median value of Y in each band and 
develop the band smooth by connecting the five pairs of medians by straight lines on a 
scatter plot of the data. Does the band smooth suggest that the regression relation is linear? 

, Discuss. 

b. Create a series of seven overlapping neighborhoods of width 3.0 beginning at X = —.5. The 
first neighborhood will range from X = — .5 to X = 2.5; the second neighborhood will range 
from X = .5 to X = 3.5; and so on. For each of the seven overiapping neighborhoods, fit a 
linear regression function and obtain the fitted value Y c at the center X c of the neighbortiood. 
Develop a simplified version of the lowess smooth by connecting the seven (X c , y c ) pairs 
by straight lines on a scatter plot of the data. 

c. Obtain the 95 percent confidence band for the true regression line and plot it on the plot 
prepared in part (b). Does the simplified lowess smooth fall entirely within the confidence 
band for the regression line? What does this tell you about the appropriateness of the linear 
regression function? 


Case 

Studies 


3.31. Refer to the Real estate sales data set in Appendix C.7. Obtain a random sample of 200 cases 
from the 522 cases in this data set. Using the random sample, build a regression model to 
predict sales price (7) as a function of finished square feet (X). The analysis should include an 
assessment of the degree to which the key regression assumptions are satisfied. If the regression 
assumptions are not met, include and justify appropriate remedial measures. Use the final model 
to predict sales price for two houses that are about to come on the market: the first has X = 1100 
finished square feet and the second has X = 4900 finished square feet. Assess the strengths 
and weaknesses of the final model. 

3.32. Refer to the Prostate cancer data set in Appendix C.5. Build a regression model to predict PSA 

level (7) as a function of cancer jolume (X). The analysis should include an assessment of 
the degree to which the key regression assumptions are satisfied. If the regression assumptions 
are not met, include and justify appropriate remedial measures. Use the final model to estimate 
mean PSA level for a patient whose cancer volume is 20 cc. Assess the strengths and weaknesses 
of the final model. * 




Chapter 



Simultaneous Inferences 
and Other Topics in - 
Regression Analysis 


In this chapter, we take up a variety of topics in simple linear regression analysis. Several 
of the topics pertain to how to make simultaneous inferences from the same set of sample 
observations. 

4.1 Joint Estimation of 卢 0 and _ 

Need for joint Estimation 

A market research analyst conducted a study of the relation between level of advertising 
expenditures (X) and sales (V). The study included six different levels of advertising ex¬ 
penditures, one of which was no advertising (X = 0). The scatter plot suggested a linear 
relationship in the range of the advertising expenditures levels studied. The analyst now 
wishes to draw inferences with confidence coefficient .95 about both the intercept j6o and the 
slope J3j. The analyst could use the methods of Chapter 2 to construct separate 95 percent 
confidence intervals for jB 0 and 卜 The difficulty is that these would not provide 95 percent 
confidence that the conclusions for both j6 0 andj6i are correct. If the inferences were indepen¬ 
dent, the probability of both being correct would be (,95) 2 , or only .9025. The inferences are 
not, however, independent, coming as they do from the same set of sample data, which makes 
the determination of the probability of both inferences being correct much more difficult. 

Analysis of data frequently requires a series of estimates (or tests) where the analyst 
would like to have an assurance about the correctness of the entire set of estimates (or tests). 
We shall call the set of estimates (or tests) of interest the family of estimates (or tests). In our 
illustration, the family consists of two estimates, for j6 0 and j6i. We then distinguish between a 
statement confidence coefficient and a family confidence coefficient. A statement confidence 
coefficient is the familiar type of confidence coefficient discussed earlier, which indicates the 
proportion of correct estimates that are obtained when repeated samples are selected and the 
specified confidence interval is calculated for each sample. A family confidence coefficient, 
on the other hand, indicates the proportion of families of estimates that are entirely correct 

154 




Chapter 4 Simultaneous Inferences and Other Topics in Regression Analysis 155 


when repeated samples are selected and the specified confidence intervals for the entire 
family are calculated for each sample. Thus, a family confidence coefficient corresponds to 
the probability, in advance of sampling, that the entire family of statements will be correct. 

To illustrate the meaning of a family confidence coefficient further, consider again the 
joint estimation of j6o andj6 t . A family confidence coefficient of, say, .95 would indicate here 
that if repeated samples are selected and interval estimates for both j6o and j6] are calculated 
for each sample by specified procedures, 95 percent of the samples would lead to a family 
of estimates where both confidence intervals are correct. For 5 percent of the samples, either 
one or both of the interval estimates would be incorrect. 

A procedure that provides a family confidence coefficient when estimating both and 

is often highly desirable since it permits the analyst to weave the two separate results together 
into an integrated set of conclusions, with an assurance that the entire set of estimates is 
correct. We now discuss one procedure for constructing simultaneous confidence intervals 
for j6 0 and 仇 with a specified family confidence coefficient — the Bonferroni procedure. 

Bonferroni Joint Confidence Intervals 

The Bonferroni procedure for developing joint confidence intervals for j6o and 仏 with a 
t specified family confidence coefficient is very simple: each statement confidence coefficient 

is adjusted to be higher than 1 — or so that the family confidence coefficient is at least 1 — or. 
The procedure is a general one that can be applied in many cases, as we shall see, not just 
for the joint estimation of j6 0 and 卜 

We start with ordinary confidence limits for and with statement confidence coef¬ 
ficients 1 — a each. These limits are: 

fco 士 ？ (1 — a/2\n — 7)s {^o} 
b { ± r(l — a/2\n — 2)5{fc t } 

We first ask what is the probability that one or both of these intervals are incorrect. Let A\ 
denote the event that the first confidence interval does not cover j6q, and let denote the 
event that the second confidence interval does not cover 卜 We know: 

P(Ai) = a P(^2) = a 

Probability theorem (A.6) gives the desired probability: 

P(A,UA 2 ) = P(AO + P(A 2 ) - P(A l nA 2 ) 

Next, we use complementation property (A.9) to obtain the probability that both intervals 
are correct, denoted by P(A i H A 2 ): 

P{A X nA 2 ) = 1 - P{A X U A 2 ) = 1 — 尸 (AO - P{A 2 ) + P{A X n A 2 ) (4.1) 

! ' 

Note from probability properties (A.9) and (A. 10) that A [ dA 2 and A x U A 2 are comple¬ 
mentary events: , 

1 - P(Ai U A 2 ) = P{A l UA 2 ) = P(A { nA 2 ) 

m * 

Finally, we use the fact that P{A\ fl A 2 ) >0 to obtain from (4.1) the Bonferroni 
inequality: 

P(A ! fl A 2 ) > 1 — ^ (^l) — P{^i) (4.2) 



156 Part One Simple Linear Regression 


which for our situation is: 


Example 


P(A i DA 2 ) ^ 1 一 ol 一 of = 1 — 2of (4.2q) 

Thus, if j6q and j6i are separately estimated with, say, 95 percent confidence intervals, the 
Bonferroni inequality guarantees us a family confidence coefficient of at least 90 percent 
that both intervals based on the same sample are correct. 

We can easily use the Bonferroni inequality (4.2a) to obtain a family confidence coeffi¬ 
cient of at least 1 — a for estimating j6o and j6i，Wedo this by estimating j6 0 and j6i separately 
with statement confidence coefficients of 1 — a/2 each. This yields the Bonferroni bound 
\~a/2—a/2 = 1 — a. Thus, the 1 — a family confidence limits for j6o and j6[ for regression 
model (2.1) by the Bonferroni procedure are: 

b 0 ± Bs[b 0 } b^Bslbi] 

where: 

B ~t(\ — a/4; n —2) 

and bo, bi, 5 (^ 0 }, and 5 (^ 1 } are defined in (1.10), (2.9), and (2.23). Note that a statement 
confidence coefficient of 1 — a/2 requires the (1 — a/4)100 percentile of the t distribution 
for a two-sided confidence interval. 

For the Toluca Company example, 90 percent family confidence intervals for j6o and 
require B = t{\ — .10/4; 23) = f (.975; 23) = 2.069. We have from Chapter 2: 

b 0 = 62.37 5 {^ 0 } = 26.18 

b\ = 3.5702 5 (^ 1 } = .3470 

Hence, the respective confidence limits for j6 0 and are 62.37 ± 2.069(26.18) and 
3.5702 ± 2.069(.3470), and the joint confidence intervals are ： 

8.20 < j6 0 < 116.5 
2.85 <^< 4.29 

Thus, we conclude that j6 0 is between 8.20 and 116.5 and is between 2.85 and 4.29. 
The family confidence coefficient is at least .90 that the procedure leads to correct pairs of 
interval estimates. 

Comments 

1. We reiterate that the Bonferroni 1 — a family confidence coefficient is actually a lower bound 
on the true (but often unknown) family confidence coefficient. To the extent that incorrect interval 
estimates of )3o and tend to pair up in the family, the families of statements will tend to be correct 
more than (1 — a) 100 percent of the time. Because of this conservative nature of the Bonferroni 
procedure, family confidence coefficients are frequently specified at lower levels (e.g., 90 percent) 
than when a single estimate is made. 

2. The Bonferroni inequality (4.2a) can easily be extended to g simultaneous confidence intervals 
with family confidence coefficient \ —a: 

> 1 - (4.4) 



(4.3) 

(4.3a) 



Chapter 4 Simultaneous Inferences and Other Topics in Regression Analysis 157 


Thus, if g interval estimates are desired with family confidence coefficient 1 — constructing each 
interval estimate with statement confidence coefficient 1 — ajg will suffice. 

3. For a given family confidence coefficient, the larger the number of confidence intervals in the 
family, the greater becomes the multiple B, which may make some or all of the confidence intervals 
too wide to be helpful. The Bonferroni technique is ordinarily most usefiil when the number of 
simultaneous estimates is not too large. 

4. It is not necessary with the Bonferroni procedure that the confidence intervals have the same 
statement confidence coefficient. Different statement confidence coefficients, depending on the impor¬ 
tance of each estimate, can be used. For instance, in our earlier illustration might be estimated with 
a 92 percent confidence interval and with a 98 percent confidence interval. The family confidence 
coefficient by (4.2) will still be at least 90 percent. 

5. Joint confidence intervals can be used directly for testing. To illustrate this use, an industrial 
engineer working for the Toluca Company theorized that the regression & net ion should have an 
intercept of 30.0 and a slope of 2.50. Although 30.0 falls in the confidence interval for ^ 2.50 does 
not fall in the confidence interval for 卜 Thus, the engineer's theoretical expectations are not correct 
at the a = .10 family level of significance. 

6 . The estimators bo and^i are usually correlated, but the Bonferroni simultaneous confidence lim¬ 
its in (4.3) only recognize this correlation by means of the bound on the family confidence coefficient. 
It can be shown that the covariance between b 0 and bi is; 

^{b 0 ,bi} = -X or 2 {bi} (4.5) 


Note that if X is positive, b 0 and bi are negatively correlated, implying that if the estimate h is too 
high, the estimate 办 o is likely to be too low, and vice versa. 

In the Toluca Company example, X = 70.00; hence the covariance between b 0 and bi is negative. 
This implies that the estimators bo and bi here tend to err in opposite directions. We expect this intu¬ 
itively. Since the observed points (X,-, Yj) fall in the first quadrant (see Figure 1.10a), we anticipate 
that if the slope of the fitted regression line is too steep {b\ overestimates j0i), the intercept is most 
likely to be too low (b 0 underestimates fy), and vice versa. 

When the independent variable is X ； — X, as in the alternative model (1.6), and b\ are uncor¬ 
related according to (4.5) because the mean of the X- t — X observations is zero. ■ 


4.2 Simultaneous Estimation of Mean Responses 


Often the mean responses at a number of X levels need to be estimated from the same 
Sample data. The Toluca Company, for instance, needed to estimate the mean number 
of work hours for lots of 30, 65, and 100 units in its search for the optimum lot size. We 
already know howto estimate the mean Response for any one level of X with given statement 
confidence coefficient. Now wt shall discuss two procedures for simultaneous estimation 
of a number of different mean responses with a family confidence coefficient, so that there 
is a known assurance of all of the estimates of mean responses being correct. These are the 
Working-Hotelling and the BonfeiroAi procedures. 

The reason why a family confidence coefficient is needed for estimating several mean 
responses even though all estimates are based on the same fitted regression line is that 
the separate interval estimates of E{Y h ] at the different Xh levels need not all be correct 
or all be incorrect. The combination of sampling errors in bo and b\ may be such that 



158 Part One Simple Linear Regression 


the interval estimates of E{Yi t } will be correct over some range of X levels and incorrect 
elsewhere. 

Working-Hotelling Procedure 

The Woridng-Hotelling procedure is based on the confidence band for the regression line 
discussed in Section 2.6. The confidence band in (2.40) contains the entire regression line and 
therefore contains the mean responses at all X levels. Hence, we can use the boundary values 
of the confidence band at selected X levels as simultaneous estimates of the mean responses 
at these X levels. The family confidence coefficient for these simultaneous estimates will 
be at least 1 — a because the confidence coefficient that the entire confidence band for the 
regression line is correct is 1 — or. ^ 

The Working-Hotelling procedure for obtaining simultaneous confidence intervals for 
the mean responses at selected X levels is therefore simply to use the boundary values in 
(2.40) for the X levels of interest. The simultaneous confidence limits for g mean responses 
E{Yh) for regression model (2.1) with the Working-Hotelling procedure therefore are: 

Y h ±Ws{Y h } (4.6) 

where: 

W 2 = 2F(l -a\2,n-2) (4.6a) 

and Y h and are defined in (2.28) and (2.30), respectively. 

For the Toluca Company example, we require a family of estimates of the mean number 
of work hours at the following lot size levels: Xf, = 30, 65, 100. The family confidence 
coefficient is to be .90. In Chapter 2 we obtained f h and s{ff,} for X h = 65 and 100. In 
similar fashion, we can obtain the needed results for lot size Xh = 30. We summarize the 
results here: 


X h 

h 

sih} 

30 

169.5 

16.97 

65 

294.4 

9.918 

100 

419.4 

14.27 


For a family confidence coefficient of .90, we require F(.90; 2, 23) = 2.549. Hence: 

W 2 = 2(2.549) 5.098 W = 2.258 

We can now obtain the confidence intervals for the mean number of work hours at X；, = 30, 
65, and 100: 

131.2 =169.5 — 2.258(16.97) < E{Y h } < 169.5 + 2.258(16.97) = 207.8 

272.0 = 294.4 - 2.258(9.918) < E{Y h ] < 294.4 + 2.258(9.918) = 316.8 

387.2 = 419.4 - 2.258(14.27) < E{Y h } < 419.4 + 2.258(14.27) = 451.6 

With family confidence coefficient .90, we conclude that the mean number of work hours 

required is between 131.2 and 207.8 for lots of 30 parts, between 272.0 and 316.8 for lots 


Example 



Chapter 4 Simultaneous Inferences and Other Topics in Regression Analysis 159 


of 65 parts, and between 387.2 and 451.6 for lots of 100 parts. The family confidence 
coefficient .90 provides assurance that the procedure leads to all correct estimates in the 
family of estimates. 

Bonferroni Procedure 

The Bonferroni procedure, discussed earlier for simultaneous estimation of j6 0 and j6i, is 
a completely general procedure. To construct a family of confidence intervals for mean 
responses at different X levels with this procedure, we calculate in each instance the usual 
confidence limits for a single mean response E[Yh} in (2.33), adjusting the statement con¬ 
fidence coefficient to yield the specified family confidence coefficient. 

When E{Yh) is to be estimated for g levels Xh with family confidence coefficient 1 — or, 
the Bonferroni confidence limits for regression model (2.1) are: 

Y h ±Bs{Y h ) 

where: 

B = f(l — a/2g;n — 2) (4.7a) 

and g is the number of confidence intervals in the family. 

For the Toluca Company example, the Bonferroni simultaneous estimates of the mean 
number of work hours for lot sizes = 30,65, and 100 with family confidence coefficient 
.90 require the same data as with the Woridng-Hotelling procedure. In addition, we require 
B = t[l- .10/2(3); 23] = r(.9833; 23) = 2.263. 

We thus obtain the following confidence intervals, with 90 percent family confidence 
coefficient, for the mean number of work hours for lot sizes Xh = 30, 65, and 100: 

131.1 = 169.5 - 2.263(16.97) < E{Y h ) < 169.5 + 2.263(16.97) = 207.9 

272.0 = 294.4 - 2.263(9.918) < E[Y h } < 294.4 + 2,263(9.918) = 316.8 

387.1 = 419.4 - 2.263(14.27) < E{Y h ) < 419.4 + 2.263(14.27) = 451.7 

Comments 

1. In this instance the Working-Hotelling confidence limits are slightly tighter than, or the same 
as, the Bonferroni limits. In other cases where the number of statements is small, the Bonferroni 
limits may be tighter. For larger families, the Woridng-Hotelling confidence limits will always be 
the tighter, since W in (4.6a) stays the same for any number of statements in the family whereas B 
in (4.7a) becomes larger as the number of statements increases. In practice, once the family confi¬ 
dence coefficient has been decided upon, one can calculate the W and B multiples to determine which 
procedure leads to tighter confidence limits 1 . 

2. Both the Woridng-Hotelling and Bonferroni procedures provide lower bounds to the actual 

family confidence coefficient. , 

3. The levels of the predictor variable for which the mean response is to be estimated are sometimes 

not known in advance. Instead, the levelsof interest are determined^as the analysis proceeds. This was 
the case in the Toluca Company example, where the lot size levels of interest were determined after 
analyses relating to other factors affecting the optimum lot size were completed. In such cases, it is 
betta'to use the Woridng-Hotelling procedure because the family for this procedure encompasses all 
possible levels of X. ■ 


Example 


k ( 4 - 7 ) 



160 Part One Simple Linear Regression 


4.3 Simultaneous Prediction Intervals for New Observations 


Now we consider the simultaneous predictions of g new observations on F in g indepen¬ 
dent trials at g different levels of X. Simultaneous prediction intervals are frequently of 
interest. For instance, a company may wish to predict sales in each of its sales regions from 
a regression relation between region sales and population size in the region. 

Two procedures for making simultaneous predictions will be considered here: the Scheffe 
and Bonferroni procedures. Both utilize the same type of limits as those for predicting a 
single observation in (2.36), and only the multiple of the estimated standard deviation is 
changed. The Scheffe procedure uses the F distribution, whereas the Bonferroni procedure 


uses the f distribution. The simultaneous prediction limits for g predictions wth the Scheffe 
procedure with family confidence coefficient 1 —a are: 


Y h ± Ss {pred} 


(4-8) 


where*. 

S 2 = gF(l - a; g,n — 2) (4.8a) 

and 5 {pred} is defined in (2.38). With the Bonferroni procedure, the 1 — a simultaneous 
prediction limits are: 


Yh ± 55 {pred} 


(4.9) 


where: 


Example 


B — t{\ — or/2g; n ~ 2) (4.9a) 

The S and B multiples can be evaluated in advance to See which procedure provides tighter 
prediction limits. 

The Toluca Company wishes to predict the work hours required for each of the next two 
lots, which will consist of 80 and 100 units. The family confidence coefficient is to be 
95 percent. To determine which procedure will give tighter prediction limits, we obtain the 
S and B multiples: 


5 2 = 2F(.95; 2, 23) = 2(3.422) = 6.844 5 = 2.616 

B = t[l- .05/2(2); 23] = f(.9875; 23) = 2.398 

We see that the Bonferroni procedure will yield somewhat tighter prediction limits. The 
needed estimates, based on earlier results, are (calculations not shown): 


X h 

h 

s{pred} 

Bs{pred} 

80 

348.0 

49.91 

119.7 

100 

419.4 

50.87 

122.0 


The simultaneous prediction limits for the next two lots, with family confidence coefficient 
.95, when Xh = 80 and 100 then are: 

228.3 = 348.0 - 119.7 < 5 348.0 + 119.7 = 467.7 

297.4 = 419.4 — 122.0 < F Mnew) < 419.4 + 122.0 = 541.4 





Chapter 4 Simultaneous Inferences and Other Topics m Regression Analysis 161 


With family confidence coefficient at least .95, we can predict that the work hours for the 
next two production inns will be within the above pair of limits. As we noted in Chapter 2, the 
prediction limits are very wide and may not be too useful for planning worker requirements. 

Comments 

1. Simultaneous prediction intervals for g new observations on 7 at g different levels of X with 
a 1 —a family confidence coefficient are wider than the corresponding single prediction intervals 
of (2.36). When the number of simultaneous predictions is not large, however, the difference in the 
width is only moderate. For instance, a single 95 percent prediction inter^I for the Toluca Company 
example would utilize a t multiple of ? (.975; 23) = 2.069, which is only moderately smaller than the 
multiple B = 2.398 for two simultaneous predictions. 

2. Note that both the B and 5 multiples for simultaneous predictions become larger as g increases. 

This contrasts with simultaneous estimation of mean responses wha'e the B multiple becomes larger 
but not the W multiple. When g is large, both the B and S multiples for simultanecws predictions 
may become so large that the prediction intervals will be too wide to be useful. Other simultaneous 
estimation techniques might then be considered, as discussed in Reference 4.1. ■ 

4.4 Regression through Origin _ 

Sometimes the regression function is known to be linear and to go through the origin at 
(0, 0). This may occur, for instance, when X is units of output and Y is variable cost, so Y 
is zero by definition when X is zero. Another example is where X is the number of brands 
of beer stocked in a supermarket in an experiment (including some supermarkets with no 
brands stocked) and Y is the volume of beer sales in the supermarket. 


Model 

\ 

The normal error model for these cases is the same as regression model (2.1) except that 

= o: 

Y^^Xi+Si (4.10) 

wherer 

is a parameter t 

Xi are known constants 
Si are independent N(Q, a 2 ) 

The regression function for model (4.10) is: 

} 

- E{Y} = j6,X (4.11) 

which is a straight line through the origin, with slope 

• ^ 

inferences 

> 

The least squares estimator of j6i in regression model (4.10) is obtained by minimizing: 

Q = ( 4 . 12 ) 


162 Part One Simple Linear Regression 


Example 


with respect to j6i，The resulting normal equation is: 

J2x i (Y i -b,Xi)=0 

leading to the point estimator: 

EM 


b\ = 


E 宥 


(4.13) 


(4.14) 


The estimator b x in (4.14) is also the maximum likelihood estimator for the normal error 
regression model (4.10). 

The fitted value for the ith case is: „ 

Yi = b x Xt (4.15) 

and the ith residual is defined, as usual, as the difference between the observed and fitted 
values: 


e^Yi-Y^Yi-b.Xi 

An unbiased estimator of the error variance a 2 for regression model (4.10) is: 


s = MSE = 


n 


n 


(4.16) 


(4.17) 


The reason for the denominator « — 1 is that only one degree of freedom is lost in estimating 
the single parameter in the regression function (4.11). 

Confidence limits forj6i, and anew observation F/, (new ) for regression model (4.10) 

are shown in Table 4.1. Note that the t multiple has n — l degrees of freedom here, the 
degrees of freedom associated with MSE. The results in Table 4.1 are derived in analogous 
fashion to the earlier results for regression model (2.1). Whereas for model (2.1) with an 
intercept we encounter terms (X； — X ) 2 or (X h — X) 2 , here we find Xf and Xl because of 
the regression through the origin. 


The Charles Plumbing Supplies Company operates 12 warehouses. In an attempt to tighten 
procedures for planning and control, a consultant studied the relation between number of 
work units performed (X) and total variable labor cost (7) in the warehouses during a test 
period. A portion of the data is given in Table 4.2, columns 1 and 2, and the observations 
are shown as a scatter plot in Figure 4.1. 


TABLE 4.1 

Confidence 
Limits for 
Regression 
through 
Origin. 


Estimate of 

Estimated Variance 

Confidence Limits 


A 

2f . T MSE 

b-i ± 

(4.18) 

E{Y h ) 


h 土 t5{? h } 

(4.19) 

^f?(new) 

s 2 {pred} = MSE^l + ^ 

9^ ± tsfpred} 

(4.20) 


where ： t= t(l - a/2- f n— 1) 



Chapter 4 Simultaneous Inferences and Other Topics in Regression Analysis 163 


TABLE 4.2 


(1) 

(2) 

<3) 

⑷ 

(S) 

⑹ 

Regression 


Work 

Variable 





through 


Units 

Labor Cost 





Origin 一 

Warehouse 

Performed 

(dollars) 





Warehouse 

Example. 

/ 

X/ 

Yi 

m 

Xf 


ei 


1 

20 

114 

2,280 

400 

93.71 

20.29 


2 

196 

921 

180,516 

3&,4i6 

918.31 

2.69 


3 

115 

560 

64,400 

13,225 

538.81 

21.19 


10 

147 

670 

98,490 

2h609 

688.74 

-18.74 


11 

182 

828 

150,696 

33,124 

852.72 

一 24,72 


12 

160 

762 

121,920 

25,600 

749.64 

12.36 


Total 

1,359 

6,390 

894J14 

190,963 

6 f 367.28 

* 22.72 


FIGURE 4.1 
Scatter Hot 
and Fitted 


Regression 



Warehouse 


Example. 



Model (4.10) for regression through the origin Was employed since Y involves variable 
costs only and the other conditions of the model appeared to be satisfied as Well. From 
Table 4.2, columns 3 and 4, we have X- t Yi = 894,714 and Xf = 190,963. Hence: 


b x = 


894,714 


: 4.68527 


190,963 

and the estimated regression function is: 

- Y = 4.68527X 

In Table 4.2, the fitted values are shown in column 5, the residuals in column 6. The fitted 
regression line is plotted in Figure 4.1 and it appears to be a jood fit. 

An interval estimate of is desired with a 95 percent confidence coefficient. By squaring 
the residuals in Table 4.2, column 6, afid then summing them, we obtain (calculations not 
shown): 

Y^ef 2,457.6 


s 


MSE 


n 


1 


11 


223.42 




164 Part One Simple Linear Regression 


From Table 4.2, column 4, we have ^ Xf = 190,963. Hence: 


作 ,}二 


MSE 


_ 223.42 
~ 190963 


=.0011700 


5 {^} = .034205 


For a 95 percent confidence coefficient, we require r(.975; 11) = 2.201. The confidence 
limits, by (4.18) in Table 4.1, are 4.68527 ± 2.201(.034205). The 95 percent confidence 
interval for 办 therefore is: 


4.61 < < 4.76 

Thus, with 95 percent confidence, it is estimated that the mean variable labor cost increases 
by somewhere between $4.61 and $4.76 for each additional work unit performeik^ 


Important Cautions for Using Regression through Origin 

In using regression-through-the-origin model (4.10), the residuals must be interpreted with 
care because they do not sum to zero usually, as may be seen in Table 4.2, column 6, for 
the warehouse example. Note from the normal equation (4.13) that the only constraint on 
the residuals is of the form X/e,- = 0. Thus, in a residual plot the residuals will usually 
not be balanced around the zero line. 

Another important caution for regression through the origin is that the sum of the squared 
residuals SSE — for this type of regression may exceed the total sum of squares 
SSTO — — F) 2 . This can occur when the data form a curvilinear pattern or a linear 

pattern with an intercept away from the origin. Hence, the coefficient of determination 
in (2.72), R 2 ~ \ — SSE/SSTO, may turn out to be negative. Consequently, the coefficient 
of deterniination R 2 has no clear meaning for regression through the origin. 

Like any other statistical model, regression-through-the-origin model (4.10) needs to be 
evaluated for aptness. Even when it is known that the regression function must go through 
the origin, the function may not be linear or the variance of the error terms may not be 
constant In many other cases, one cannot be sure in advance that the regression line goes 
through the origin. Hence, it is generally a safe practice not to use regression-through-the- 
origin model (4.10) and instead use the intercept regression model (2.1). If the regression 
line does go through the origin, b 0 with the intercept model will differ from 0 only by a 
small sampling error, and unless the sample size is very small use of the intercept regression 
model (2.1) has no disadvantages of any consequence. If the regression line does not go 
through the origin, use of the intercept regression model (2.1) will avoid potentially serious 
difficulties resulting from forcing the regression line through the origin when this is not 
appropriate. 

Comments 

1. In interval estimation of E {7；,} or prediction of y/, (neW ) with regression through the origin, note 
that the intervals (4.19) and (4.20) in Table 4.1 widen the further X h is from the origin. The reason 
is that the value of the true regression function is known precisely at the origin, so the effect of the 
sampling error in the slope bi becomes increasingly important the farther X/ t is from the origin. 

2. Since with regression through the origin only one parameter, Pi, must be estimated for regression 
function (4.11), simultaneous estimation methods are not required to make a family of statements 
about several mean responses. For a given confidence coefficient l — a, formula (4.19) in Table 4.1 



Chapter 4 Simultaneous Inferences and Other Topics in Regression Analysis 165 


can be used repeatedly with the given sample results for different levels of X to generate a family of 
statements for which the family confidence coefficient is still l — a. 

3. Some statistical packages calculate R 2 for regression through the origin according to (2.72) 
and hence will sometimes show a negative value for R 2 . Other statistical packages calculate R 2 using 
the total uncorrected sum of squares SSTOU in (2.54). This procedure avoids obtaining a negative 
coefficient but lacks any meaningful interpretation. 

4. The ANOVA tables for regression through the origin shown in the output for many statistical 

packages are based on SSTOU = ^ Yf, SSRU = ~ an( ^ SSE = — ^X,*) 2 , 

where SSRU stands for the uncorrected regression sum of squares. It can be shown that these sums of 
squares are additive: SSTOU = SSRU + SSE. ■ 


4.5 


Effects of Measurement Errors 

- 

Iti our discussion of regression models up to this point, we have not explicitly considered 
the presence of measurement errors in the observations on either the response variable Y 
or the predictor variable X. We now examine briefly the effects of measurement errors in 
,the observations on the response and predictor variables. 


Measurement Errors in Y 

When random measurement errors are present in the observations on the response variable 
F, no new problems are created when these errors are uncorrelated and not biased (positive 
and negative measurement errors tend to cancel out). Consider, for example, a study of 
the relation between the time required to complete a task (F) and the complexity of the 
task (Z). The time to complete the task may not be measured accurately because the person 
operating the stopwatch may not do so at the precise instants called for. As long as such 
measurement errors are of a random nature, uncorrelated, and not biased, these measurement 
errors are simply absorbed in the model error terms. The model error term always reflects 
the composite effects of a large number of factors not considered in the model, one of which 
now would be the random variation due to inaccuracy in the process of measuring Y. 

Measurement Errors in X 

Unfortunately, a different situation holds when the observations on the predictor variable 
X are subject to measurement errors. Frequently, to be sure, the observations on X are 
accurate, with no measurement errors, as when the predictor variable is the price of a product 
in different stores, the number of variables in different optimization problems, or the wage 
rate for different classes of employees. At other times, however, measurement errors may 
enter the value observed for the predictor variable, for ipstance, when the predictor variable 
is pressure in a tank, temperature in an oven, speed of a production line, or reported age of 
a person. • " 

We shall use the last illustration in our development of the nature of the problem. Suppose 
we are interested in the relation between employees’ piecework earnings and their ages. 
Let Xi denote the true age of the ith employee and X* the age reported by the employee 
on the employment record. Needless to say, the two are not always the same. We define the 



66 Part One Simple Linear Regression 


measurement error ^ as follows: 

Bi = X* - X ( - (4.21) 

The regression model we would like to study is: 

^ =A) + jSi^i+£i (4.22) 

However, we observe only X* t so we must replace the true age X,- in (4.22) by the reported 
age X*, using (4.21): 

K = A) + 办 (Xf - ^ & (4.23) 

We can now rewrite (4.23) as follows: 

Y；=^o + + (£ f - M) (4.24) 

Model (4.24) may appear like an ordinary regression model, with predictor variable X* 
and error term e — j6i6, but it is not. The predictor variable observation X* is a random 
variable, which, as we shall see, is correlated with the error term ■—⑽卜 

Intuitively, we know that 灼 一 is not independent of X* since (4.21) constrains 
X* — 8i to equal X-,. To determine the dependence formally, let us assume the following 
simple conditions: 

£{6 ; } =0 (4.25a) 

五 = 0 (4.25b) 

mSied =0 (4.25c) 

Note that condition (4.25a) implies that E{X*} = E{Xi +6,-} = X it so that in our example 
the reported ages would be unbiased estimates of the true ages. Condition (4.25b) is the usual 
requirement that the model error terms & have expectation 0, balancing around the regression 
line. Finally, condition (4.25c) requires that the measurement error 6 ； not be correlated with 
the model error £■,; this follows because, by (A.21a), <r{6i, = E[8i£i} since E{Si}= 

E{€i] = 0 by (4.25a) and (4.25b). 

We now wish to find the covariance between the observations Xf and the random terms 
Si — ^\8i in model (4.24) under the conditions in (4.25), which imply that = X t and 

E{Si — ^i6；} = 0: 

criXls^ jM,} = E{[X* - E{X*}][( £i - Mi) - E[ Si - 卢而 }]} 

= E{(X ： ~ XiKe ： - ^Si)} 

~ E[Si(Sj — 

=E — ^ 16 ?} 

Now E[8pi] = Oby (4.25c), and E{Sf] = <r 2 {^-}by (A. 15a) because E[Si } — Oby (4.25a). 
We therefore obtain: 

criX^Si- M ‘ 卜 -j6,a 2 {6 t -} (4.26) 

This covariance is not zero whenever there is a linear regression relation between X and Y. 

If we assume that the response Y and the random predictor variable X* follow a bivariate 
normal distribution, then the conditional distribution of the Y- t , i = l,.. .n, given Xf, 



Chapter 4 Simultaneous Inferences and Other Topics in Regression Analysis 167 


i = 1， .. •《，are normal and independent, with conditional mean E[Yi\X*} = + ^*Xf 

and conditional variance cfy^. Furthermore, it can be shown that : = 灼 [o|/(o"|+of )]， 

where g\ is the variance of X and a} is the variance of Y. Hence, the least squares slope 
estimate from fitting Y on X* is not an estimate of j6i, but is an estimate of 
The resulting estimated regression coefficient of j6* will be too small on average, with the 
magnitude of the bias dependent upon the relative sizes of and a}. If cfy is small relative 

to a|，then the bias would be small; otherwise the bias may be substantial. Discussion 
of possible approaches to estimating that are obtained by estimating these unknown 
variances g\ and a} will be found iti specialized texts such as Reference 4.2. 

Another approach is to use additional variables that are known to be related to the true 
value of X but not to the errors of measurement 8. Such variables are called instrumental 
variables because they are used as an instrument in studying the relation between X and 
Y. Instrumental variables make it possible to obtain consistent estimators of the regression 
parameters. Again, the reader is referred to Reference 4.2. k 


Comment 

〆 What, it maybe asked, is the distinction between the case when Xisa random variable, considered in 
Chapter 2, and the case when X is subject to random measurement errors, and why are there special 
problems with the latter? When X is a random variable, the observations on X are not under the 
control of the analyst and will vary at random from trial to trial, as when X is the number of persons 
entering a store in a day. If this random variable X is not subject to measurement arors, however, it 
can be accurately ascertained for a given trial. Thus, if there are no measurement errors in counting the 
number of persons entering a store in a day, the analyst has accurate information to study the relation 
between numb 改 of persons entering the store and sales, even though the levels of number of persons 
entering the store that actually occur cannot be controlled. If, on the oth 改 hand, measurement errors 
are present in the observed number of persons entering the store, a distorted picture of the relation 
between number of persons and sales will occur because the sales observations will frequently be 
matched against an incorrect number of persons. ■ 

Berkson Model 

There is one situation where measurement errors in X are no problem. This case was first 
noted by Berkson (Ref. 4.3). Frequently, in an experiment the predictor variable is set at 
a target value. For instance, in an experiment on the effect of room temperature on word 
processor productivity, the temperature may be set at target levels of 68 。 F, 70 。 F, and 72 。 F ， 
according to the temperature control on the thermostat. The observed temperature XJ is 
fixed here, whereas the actual temperature X- t is a random variable since the thermostat fnay 
not be completely accurate. Similar situations exist when water pressure is set according to 
a gauge, or employees of specified ages according to their employment records are selected 
for a study. 

In all of these cases, the observation X* is a fixed quantity, whereas the unobserved true 

value X t is a random variable. The measurement error is, as before: 

* ^ 


^ = X ； - Xi (4.27) 

Here, however, there is no constraint on the relation between X* and Si, since X* is a fixed 
quantity. Again, we assume that £{&•}=(). 



168 Part One Simple linear Regression 


Model (4.24), which we obtained when replacing X ； by X* — Si t is still applicable for 
the Berkson case ： 

Y t 二 A) + + (e, - ^8 t ) (4.28) 

The expected value of the error term, E[£ t — zero as before under conditions (4.25a) 
and (4.25b), since E{e,} = 0 and E {6,} = 0. However, e ； — j6i6,- is now uncorrelated with 
X*, since X* is a constant for the Berkson case. Hence, the following conditions of an 
ordinary regression model are met: 

1. The error terms have expectation zero. 

2. The predictor variable is a constant, and hence the error terms are not correlated with it. 

■ 

Thus, least squares procedures can be applied for the Berkson case without modification, 
and the estimators bo and b\ will be unbiased. If we can make the standard normality and 
constant variance assumptions for the errors — , the usual tests and interval estimates 

can be utilized. & 

4.6 Inverse Predictions_ 

At times, a regression model of F on X is used to make a prediction of the value of X which 
gave rise to a new observation Y. This is known as an inverse prediction. We illustrate 
inverse predictions by two examples: 

1. A trade association analyst has regressed the selling price of a product (F) on its cost 
(X) for the 15 member firms of the association. The selling price F ft(new) for another firm 
not belonging to the trade association is known, and it is desired to estimate the cost Xh( new ) 
for this firm. 

2. A regression analysis of the amount of decrease in cholesterol level (F) achieved 

with a given dosage of a new drug (X) has been conducted, based on observations for 
50 patients. A physician is treating a new patient for whom the cholesterol level should 
decrease by the amount It is desired to estimate the appropriate dosage level ^( neW ) 

to be administered to bring about the needed cholesterol decrease F ft ( new ). 

In inverse predictions, regression model (2.1) is assumed as before: 

— ^0 + ^1 + £ i (4.29) 

The estimated regression function based on n observations is obtained as usual: 

F = + (4.30) 

A new observation F"( new ) becomes available, and it is desired tD estimate the level X^^ew) 
that gave rise to this new observation. A natural point estimator is obtained by solving (4.30) 
forX, given Y h{new) : 

^ft(new) = -- t>i ^0 (4.31) 

b\ 

where X/ I(new ) denotes the point estimator of the new level ^( new ). Figure 4.2 contains 
a representation of this point estimator for an example to be discussed shortly. It can be 



Chapter 4 Simultaneous Inferences and Other Topics in Regression Analysis 169 


FIGURE 4.2 
Scatter Plot 
and Fitted 
Regression 
line — 
Calibration 
Example. 


2 4 6 8 10 

Actual Galactose Concentration 


Example 


shown that the estimator X ~ new ) is the maximum likelihood estimator of Z^( new ) for normal 
error regression model (2.1). 

Approximate l — a confidence limits for Xh^ ew ) are: 

允 /i(new) 土 f (1 — a/2; n — 2)5{predX} (4.32) 


where: 


5 2 {predX} 


旧 1 + i 

b \ L 打 


{^hjnev/) ~ 


(4.32a) 


A medical researcher studied a new, quick method for measuring low concentration of 
galactose (sugar) in the blood. Twelve samples were used in the study containing known 
concentrations (X), with three samples at each of four different levels. The measured 
concentration (F) was then observed for each sample. Linear regression model (2.1) Was 
fitted with the following results: 


n = 12 bo = —.100 b\ = 1.017 MSE = .0272 

5{^,} = .0142 X = 5.500 Y = 5.492 -Xf = 135 

〆 F = -.100+1.017X 

The data and the estimated regression line are plotted in Figure 4.2. 

The researcher first wished to make sure that there is a linear association between the 
two variables. A test of H 0 : =0 versus H a : ^0, utilizing test statistic t* = bi/s{bi}^= 

1.017/.0142 = 71.6, was conducted for a = .05. Since f(.975; 10) = 2.228 and |f*| 二 
71.6 > 2.228, it was concluded that 0, or that a linear association exists between the 
measured concentration and the actual concentration. 

The researcher now wishes to use the regression relation to ascertain the actual con¬ 
centration Xh(mw) for a new patient for whom the quick procedure yielded a measured 
concentration of 3^( new ) = 6.52. It is desired to estimate X^( new ) by means of a 95 percent 


MSE= ,0272 




170 Part One Simple Linear Regression 


confidence interval. Using (4.31) and (4.32a), we obtain: 


■^0!(new) 

5 2 {predX} 


6.52- (-.100) 


1.017 


= 6.509 


‘0272 

(1:017) 2 


1 + r 2 


(6.509 - 5.500) 
135 


21 


=.0287 


so that,s{predX} = .1694. We require f (.975; 10) = 2.228, and using (4.32) we obtain the 
confidence limits 6.509 ± 2.228(.1694). Hence, the 95 percent confidence interval is: 


6.13 < X Mnew) < 6.89 



Thus, it can be concluded with 95 percent confidence that the actual galactose concentration 
for the patient is between 6.13 and 6.89. This is approximately a 土 6 percent error, which 
is considered reasonable by the researcher. 


Comments 

1. The inverse prediction problem is also known as a calibration problem since it is applicable 

when inexpensive, quick, and approximate measurements (7) are related to precise, often expensive, 
and time-consuming measurements (X) based on n observations. The resulting regression model is 
then used to estimate the precise measurement for a new approximate measurement F A(new) . 

We illustrated this use in the calibration example. 

2. The approximate confidence interval (4.32) is appropriate if the quantity: 


[?(1 - a/2; n - 2)fMSE 
~~ 


is small, say less than .1. For the calibration example, this quantity is: 

(2.228) 2 (.0272) 


(1.017) 2 (135) 


.00097 


(4.33) 


so that the approximate confidence interval is appropriate here. 

3. Simultaneous prediction intervals based on g different new observed measurements i*(new), 
with a 1 — a family confidence coefficient, are easily obtained by using either the Bonferroni or the 
Scheffe procedures discussed in Section 4.3. The value of f (I — a/2; k — 2) in (4.32) is replaced by 
either jB =?(I — a/2g; n — 2) or 5 = [gF(l —a;g,n — 2)] 1/2 . 

4. The inverse prediction problem has aroused controversy among statisticians. Some statisticians 

have suggested that inverse predictions should be made in direct fashion by regressing X on Y. This 
regression is called inverse regression. ■ 


4.7 Choice of X Levels 


When regression data are obtained by experiment, the levels of X at which observations 
on Y are to be taken are under the control of the experimenter. Among other things, the 




Chapter 4 Simultaneous Inferences and Other Topics in Regression Analysis 17l 


experimenter will have to consider; 

1. How many levels of X should be investigated? 

2. What shall the two extreme levels be? 

3. How shall the other levels of X, if any, be spaced? 

4. How many observations should be taken at each level of XI 


There is no single answer to these questions, since different purposes of the regression 
analysis lead to different answers. The possible objectives in regression analysis are varied, 
as we have noted earlier. The main objective may be to estimate the slope of the regression 
line or, in some cases, to estimate the intercept. In many cases, the main objective is to 
predict one or more new observations or to estimate one or more mean responses. When 
the regression function is curvilinear, the main objective may be to locate the maximum or 
minimum mean response. At still other times, the main purpose is to determine the nature 
of the regression function. 


To illustrate how the purpose affects the design, consider the variances of b\,Y h , atid^ 
for predicting Yh ( new )，which were developed earlier for regression model (2.1): 


a 2 {b 0 j = a : 


1 X 2 


(434) 


crHh]= 


E(U ) 2 


(4.35) 


<r 2 {Y h ] = a 2 


1 (X h ~X) 2 

; E (兄-幻 2 


<r 2 {pred} = a 2 


1 + i + .H 

« 足 ■—X ) 2 


(4.36) 

(437) 


If the main purpose of the regression analysis is to estimate the slope the variance of 
is minimized if — X) 2 is maximized. This is accomplished by using two levels of X, 
at the two extremes for the scope of the model, and placing half of the observations at each 
of the two levels. Of course, if one were not sure of the linearity of the regression function, 
one would be hesitant to use only two levels since they would provide no information about 
possible departures from linearity. If the main purpose is to estimate the intercept j6 0 , the 
number and placement of levels does not affect the variance of bo as long &sX =0. On the 
other hand, to estimate the mean response or to predict a new observation at the level Xh, 
the relevant variance is minimized by using X levels so thatX = Xh. 

Although the number and spacing of X levels depends very much on the major purpose 
of the regression analysis, the general advice given by D. R. Cox is still relevant: 

Use two levels when the object is primarily to examine whether or not... (the predictor 
variable)... has an effect and in which direction that effect is. Use three levels whenever a 
description of the response curve by its slope and curvature is likely to be adequate; this 
should cover most cases. Use four levels if further examination of the shape of the response 
curve is important. Use mde than four levels 负 hen it is required to estimate the detailed 
shape of the response curve, or when the curve expected to rise to an asymptotic value, or 
in general to show features not adequately described by slope and curvature. Except in these 
last cases it is generally satisfactory to use equally spaced levels with equal numbers of 
observations per level (Ref. 4.4). 




172 Part One Simple Linear Regression 


Cited 4.1. Miller, R. G., Jr. Simultaneous Statistical Inference. 2nd ed. New York: Springer-Verlag, 1991. 

R.cfereiICCS 4.2. Fuller, W. A. Measurement Error Models. New York: John Wiley & Sons, 1987. 

4.3. Berkson, J. “Are There Two RegressionsT 5 ' Journal of the American Statistical Association 45 
(1950), pp. 164 - 80. 

4.4. Cox, D. R. Planning of Experiments. New York: John Wiley & Sons, 1958, pp. 141-42. 


ProblcillS 4.1. When joint confidence intervals for fio and 仏 are developed by the Bonferroni method with 

a family confidence coefficient of 90 percent, does this imply that 10 percent of the time the 
confidence interval for fio will be incorrect? That 5 percent of the time the confidence interval 
for Po will be incorrect and 5 percent of the time that for j6i will be incorrect? Discuss. 

4.2. Refer to Problem 2.1. Suppose the student combines the two confidence intei^ls into a confi¬ 
dence set. What can you say about the family confidence coefficient for this set? 

*43. Refer to Copier maintenance Problem L20. 

a. Will bo and b\ tend to err in the same direction or in opposite directions here? Explain. 

b. Obtain Bonferroni joint confidence intervals for Po and j6i, using a 95 pa'cent family confi¬ 
dence coefficient. 

c. A consultant has suggested that pQ should be 0 and 仿 should equal 14.0. Do your joint 
confidence intervals in part (b) support this view? 

*4.4. Refer to Airfreight breakage Problem 1.21. 

a. Will bQ and bi tend to err in the same direction or in opposite directions here? Explain. 

b. Obtain Bonferroni joint confidence intervals for )0。and using a 99 percent family confi¬ 
dence coefficient. Interpret your confidence intervals. 

4.5. Refer to Plastic hardness Problem 1.22. 

a. Obtain Bonferroni joint confidence into'vals for 仇 and using a 90 percent family con¬ 
fidence coefficient. Interpret your confidence intervals. 

b. Are bo and b\ positively or negatively correlated here? Is this reflected in your joint confi¬ 
dence intervals in part (a)? 

c. What is the meaning of the family confidence coefficient in part (a)? 

*4.6. Refer to Muscle mass Problem 1.27. 

a. Obtain Bonferroni joint confidence intervals for 仇 and Pi, using a 99 percent family confi¬ 
dence coefficient. Interpret your confidence intervals. 

b. Will bo and b\ tend to err in the same direction or in opposite directions here? Explain. 

c. A researcher has suggested that j6o should equal approximately 160 and that fix should be 
between—1.9 and — 1.5. Do the joint confidence intervals in part (a) support this expectation? 

*4.7. Refer to Copier maintenance Problem 1.20. 

a. Estimate the expected number of minutes spent when there are 3, 5, and 7 copiers to be 
serviced, respectively. Use interval estimates with a 90 percent family confidence coefficient 
based on the Working-Hotelling procedure. 

b ‘ Two service calls for preventive maintenance are scheduled in which the numbers of copieas 
to be serviced are 4 and 7, respectively. A family of prediction intervals for the times to 
be spent on these calls is desired with a 90 percent family confidence coefficient. Which 
procedure, Scheffe or Bonferroni, will provide tighter prediction limits here? 

c. Obtain the family of prediction intervals required in part (b), using the more efficient 
procedure ‘ 




Chapter 4 Simultaneous Inferences and Other Topics in Regression Analysis 173 


*4.8. Refer to Airfreight breakage Problem 1.21. 

a. It is desired to obtain interval estimates of the mean number of broken ampules when there 
are 0, 1， and 2 transfers for a shipment, using a 95 p^cent family confidence coefficient. 
Obtain the desired confidence intervals, using the Working-Hotelling procedure. 

b. Are the confidence intervals obtained in part (a) more efficient than Bonferroni intervals 
here? Explain. 

c. The next three shipments will make 0, 1, and 2 transfers, respectively. Obtain prediction 
intervals for the number of broken ampules for each of these three shipments, using the 
Scheffe procedure and a 95 percent family confidence coefficient. 

d. Would the Bonferroni procedure have been more efficient in developing the prediction 
intervals in part (c)? Explain. 

4.9. Refer to Plastic hardness Problem 1.22. 

a. Management wishes to obtain interval estimates of the mean hardness when the elapsed time 
is 20, 30, and 40 hours, respectively. Calculate the desired confidence intervals^ using the 
Bonferroni procedure and a 90 percent family confidence coefficient. What is the meaning 
of the family confidence coefficient here? 

b. Is the Bonferroni procedure employed in part (a) the most efficient one that could be 
employed here? Explain. 

c. The next two test items will be measured after 30 and 40 hours of elapsed time, respectively. 
Predict the hardness for each of these two items, using the most efficient procedure and a 
90 percent family confidence coefficient. 

*4.10. Refer to Muscle mass Problem 1.27. 

a. The nutritionist is particularly interested in the mean muscle mass for women aged 45,55, and 
65. Obtain joint confidence intervals for the means of interest using the Woridng-Hotelling 
procedure and a 95 percent family confidence coefficient. 

b. Is the Woridng-Hotelling procedure the most efficient one to be employed in part (a)? 
Explain. 

c. Three additional women aged 48, 59, and 74 have contacted the nutritionist. Predict the 
muscle mass for each of these three women using the Bonferroni procedure and a 95 percent 
family confidence coefficient. 

d. Subsequently, the nutritionist wishes to predict the muscle mass for a fourth woman aged 
'64, with a family confidence coefficient of 95 pa'cent for the four predictions. Will the three 
prediction intervals in part (c) have to be recalculated? Would this also be true if the Scheffe 
procedure had been used in constructing the prediction intervals? 

4.11. A behavioral scientist said, *1 am never sum whether the regression line goes through the origin. 
Hence, I will not use such a model.” Comment. 

4.12. Typographical errors. Shown below are the number of galleys for a manuscript (X) and 
the total dollar cost of correcting typographical errors (7) in a random sample of recent orders 
handled by a firm specializing in technical manuscripts. Since Y involves variable costs only, an 
analyst wished to determine whetherregression-through-the-origin model (4.10) is appropriate 
for studying the relation between the two variables. 

/: 1 2 3 4 ' 5 6 7 8T 9 10 11 12 

X；: 7 12 10 10 * 14 25 30 25 18 1 0 4 6 

Y ；： 128 213 191 178 250 446 540 457 324 1 77 75 1 07 

a. Fit regression model (4.10) and state the estimated regression function. 




174 Part One Simple Linear Regression 


b. Plot the estimated r^ression function and the data. Does a linear regression fonction through 
the origin appear to provide a good fit here? Comment. 

c. In estimating costs of handling prospective orders, management has used a standard of 
$17.50 per galley for the cost of correcting typographical errors. Test whether or not this 
standard should be revised; use a = .02. State the alternatives, decision rule, and conclusion. 

d. Obtain a prediction interval for the correction cost on a forthcomingjob involving 10 galleys. 
Use a confidence coefficient of 98 percent. 

4.13. Refer to Typographical errors Problem 4.12. 

a. Obtain the residuals e, . Do they sum to zero? Plot the residuals against the fitted values 7,-. 
What conclusions can be drawn from your plot? 

b. Conduct a formal test for lack of fit of linear regression through the origin^pse a = .01. 
State the alternatives, decision rule, and conclusion. What is the jP-value test? 

4.14. Refer to Grade point average Problem 1.19. Assume that linear regression through the origin 
model (4.10) is 印 propriate. 

a. Fit regression model (4.10) and state the estimated regression function. 

b. Estimate with a 95 percent confidence interval. Interpret your interval estimate. 

c. Estimate the mean freshman GPA for students whose ACT test score is 30. Use a 95 percent 
confidence interval. 

4.15. Refer to Grade point average Problem 4.14. 

a. Plot the fitted regression line and the data. Does the linear regression fijnction through the 
origin appear to be a good fit here? 

b. Obtain the residuals e t . Do they sum to zero? Plot the residuals against the fitted values Y t . 
What conclusions can be drawn from your plot? 

c. Conduct a formal test for lack of fit of linear regression through the origin; use a = .005. 
State the alternatives, decision rule, and conclusion. What is the F-value of the test? 

*4.16. Refer to Copier maintenance Problem 1.20. Assume that linear regression through the origin 
model (4.10) is appropriate. 

a. Obtain the estimated regression function. 

b. Estimate 爲 with a 90 percent confidence interval. Interpret your interval estimate. 

c. Predict the service time on a new call in which six copiers are to be serviced. Use a 90 percent 
prediction interval. 

*4.17. Refer to Copier maintenance Problem 4.16. 

a. Plot the fitted regression line and the data. Does the linear regression iimction through the 
origin appear to be a good fit here? 

b. Obtain the residuals e 卜 Do they sum to zero 1 ? Plot the residuals against the fitted values F,. 
What conclusions can be drawn from your plot? 

c. Conduct a formal test for lack of fit of linear regression through the origin; use a = -01- 
State the alternatives, decision rule, and conclusion. What is the P-value of the test? 

4.18. Refer to Plastic hardness Problem 1.22. Suppose that errors arise in X because the laboratory 
technician is instructed to measure the hardness of the ith specimen (7；) at a prerecorded 
elapsed time (X,-), but the timing is imperfect so the true elapsed time varies at random from 
the prerecorded elapsed time. Will ordinary least squares estimates be biased here? Discuss. 

4.19. Refer to Grade point average Problem 1.19. A new student earned a grade point average of 
3.4 in the freshman year. 



Chapter 4 Simultaneous Inferences and Other Topics in Regression Analysis 175 


a. Obtain a 90 percent confidence interval for the student’s ACT test score. Interpret your 
confidence interval. 

b. Is criterion (4.33) as to the appropriateness of the approximate confidence interval met here? 

4.20. Refer to Plastic hardness Problem 1.22. The measurement of a new test item showed 238 Brinell 
units of hardness. 

a. Obtain a 99 percent confidence interval for the elapsed time before the hardness was mea¬ 
sured. Interpret your confidence interval. 

b. Is criterion (4.33) as to the appropriateness of the approximate confidence interval met here? 

Exercises 4.21. When the predictor variable is so coded that X = 0 and the normal error regression model (2.1) 

applies, are b G and b\ independent? Are the joint confidence intervals for 仇 and 爲 then 
independent? 

4.22. Derive an extension of the Bonferroni inequality (4.2a) for the case of three statements，each 
with statement confidence coefficient l — a. 

4.23. Show that for the fitted least squares regression line through the origin (4.15), ^ X ； e ； = 0. 

4.24. Show that y as defined in (4.15) for linear regression through the origin is an unbiased estimator 

4.25. Derive the formula for 5 2 {y fc } given in Table 4.1 for linear regression through the origin. 

Projects 4.26. Refer to the CDI data set in Appendix C.2 and Project 1.43. Consider the regression relation 

of number of active physicians to total population. 

a. Obtain Bonferroni joint confidence intervals for and fii, using a 95 percent family con¬ 
fidence coefficient. 

b. An investigator has suggested that j6 0 should be —100 and 々 should be .0(G8. Do the joint 
confidence intervals in part (a) support this view? Discuss. 

c. It is desired to estimate the expected number of active physicians for counties with total 
population of X = 500, 1,000, 5,000 thousands with family confidence coefficient .90. 
Which procedure, the Woridng-Hotelling or the Bonferroni, is more efficient here? 

d. Obtain the family of interval estimates required in part (c), using the more efficient procedure. 
Interpret your confidence intervals. 

4.27. Refe to the SENIC data set in Appendix C-1 and Project 1.45. Consider the regression relation 
of average length of stay to infection risk. 

a Obtain Bonferroni joint confidence intervals for and using a 90 percent family con¬ 
fidence coefficient. 

b. A researcher suggested that should be approximately 7 and 说 should be approximately 1. 

Do the joint intervals in part (a) support this expectation? Discuss. 

c. It is desired to estimate The expected hospital stay for persons with infection risks X = 
2,3,4,5 with family confidence coefficient .95. Which procedure, the Working-Hotelling 
or the Bonferroni, is more efficient here? 

d. Obtain the family of interval estimates required in part (c), uSing the more efficient procedure. 
Interpret your confidence intervals. 



Row 1 16,000 

Row 2 33,000 

Row 3 21,000 


The elements of this particular matrix are numbers representing income (column 1) and 
age (column 2) of three persons. The elements arc arranged by row (person) and column 
(characteristic of person). Thus, the element in the first row and first column (16,000) 
represents the income of the first person. The element in the first row and second column (23) 
represents the age of the first person. The dimension of the matrix is 3x2, i.e., 3 rows by 


Chapter 

Matrix Approach to Simple 
Linear Regression Analysis 

Matrix algebra is widely used for mathematical and statistical analysis. The matrix approach 
is practically a necessity in multiple regression analysis, since it permits extensive systems 
of equations and large arrays of data to be denoted compactly and operated upon efficiently. 

In this chapter, we first take up a brief introduction to matrix algebra. (A more compre¬ 
hensive treatment of matrix algebra may be found in specialized texts such as Reference 5.1.) 
Then we apply matrix methods to the simple linear regression model discussed in previ¬ 
ous chapters. Although matrix algebra is not really required for simple linear regression, 
the application of matrix methods to this case will provide a useful transition to multiple 
regression, which will be taken up in Parts II and HI. 

Readers familiar with matrix algebra may wish to scan the introductory parts of this 
chapter and focus upon the later parts dealing with the use of matrix methods in regression 
analysis. 

5.1 Matrices_ 

Definition of Matrix 

A matrix is a rectangular array of elements arranged in rows and columns. An example of 
a matrix is: 

Column Column 
1 2 



I I 

234735 


176 




Chapter 5 Matrix Appwach to Simple Linear Regression Analysis 177 


2 columns. If we wanted to present income and age for 1,000 persons in a matrix with the 
Same format as the one earlier, we would require a 1,000 x 2 matrix. 

Other examples of matrices are: 


1 

0 " 


4 

7 

12 

16' 

5 

10 


3 

15 

9 

8 


These two matrices have dimensions of 2x 2 and 2x4, respectively. Note that in giving the 
dimension of a matrix, we always specify the number of rows first and then the number of 
columns. As in ordinary algebra, we may use symbols to identify the elements of a matrix: 




j = 1 7 = 2 j = 3 

an a i2 

“21 0,22 ^23 


Note that the first subscript identifies the row number and the second the colunln number. 
We shall use the general notation a-,j for the element in the /th row and the jth column. In 
our above example, i = I, 2 and j = 1, 2, 3. 

A matrix may be denoted by a symbol such as A ， X ， or Z. The symbol is in boldface to 
identify that it refers to a matrix. Thus, we might define for the above matrix; 


A = 


a i\ 

a 2l 


a \2 叱 

a 22 “23 


Reference to the matrix A then implies reference to the 2 x 3 array just given. 
Another notation for the matrix A just given is: 

A : [aij] i - 1, 2 ; 7 - 1, 2 , 3 


This notation avoids the need for writing out all elements of the matrix by stating only the 
general element. It can only be used, of course, when the elements of a matrix are symbols. 
To summarize, a matrix with r rows and c columns will be represented either in full: 



^11 


• • • a lj 

… a lc 



^21 

^22 

• • • Cl2j 

… ^2c 


A = 



--- a i} 

* 

.-- a lc 

、 (5.1) 


J^r\ 

^r2 

… a" 

•. - a rc _ 



or in abbreviated form: , 

A - [a[j] z = 1 ,..., r；y = 1 , ..., c 
or simply by a boldface symbol, such as A. 

Comments 

> 

1. Do not think of a matrix as a numba'. It is a set of elements arranged in an army. Only when 
the matrix has dimension 1 x 1 is there a single number in a matrix, in which case one can think of 
it interchangeably as either a matrix or a number. 



78 Part One Simple Linear Regression 


2. The following is not a matrix: 


■ 14 

8 

10 15 

9 16, 

since the numbers are not arranged in columns and rows. ■ 


Square Matrix 


A matrix is said to be square if the number of rows equals 
examples are: 


4 7 
3 9 


a" 

a \2 

a \3 

“21 

a 22 

a 23 

a 3l 

a 32 

a 33 


the number of columns. Two 


Vector 


A matrix containing only one column is called a column vector or simply a vector. Two 
examples are: 


4 

A= 7 

10 






C2 

C3 

(5 


Transpose 


The vector A is a 3 x 1 matrix, and the vector C is a 5 x 1 matrix. 

A matrix containing only one row is called a row vector. Two examples are: 

B' = [15 25 50] F = [/, / 2 ] 

We use the prime symbol for row vectors for reasons to be seen shortly. Note that the row 
vector is a 1 x 3 matrix and the row vector F 7 is a 1 x 2 matrix. 

A single subscript suffices to identify the elements of a vector. 


The transpose of a matrix A is another matrix, denoted by A!, that is obtained by inter¬ 
changing corresponding columns and rows of the matrix A. 

For example, if: 


then the transpose A' is ： 


A = 

3x2 


2 

7 

3 


5 

10 

4 


A / = 

2x3 


2 

5 


7 3 
10 4 


Note that the first column of A is the first row of A’，and similarly the second column of A 
is the second row of A 1 . Correspondingly, the first row of A has become the first column 



Chapter 5 Matrix Approach to Simple Linear Regression Analysis 179 


of A 1 , and so on. Note that the dimension of A, indicated under the symbol A, becomes 
reversed for the dimension of A 7 . 

As another example, consider. 


4 

C = 7 

3x1 10 


c = [4 7 10] 

1x3 


Thus, the transpose of a column vector is a row vector, and vice versa. This is the reason 
why we used the symbol B' earlier to identify a row vector, since it may be thought of as 
the transpose of a column vector B. 

In general, we have: 


A 

rxc 


a \\ 


^r\ 


^lc 


^rc 


[a"] 

/\ 


= 1 > - - - j — 1 9 - - - 5 ^ 


Row Column 
Index index 


(P) 


A 7 

cxr 


a 


11 




^r\ 


^rc 


[a;,-] 

/\ 


c\i -- 1,. 


(53) 


Row Column 
index index 


Thus, the element in the ith row and the 7 th column in A is found in the 7th row and ith 
column in A’. 


Equality of Matrices 

Two matrices A and B arc Said to be equal if they have the Same dimension and if all 
corresponding elements are equal. Conversely, if two matrices are equd，their corresponding 
elements a^e equal. For example, if: 


then A = B implies ： 


Similarly, if: 






4 


A = 

a 2 

B = 

7 


3x1 


3xl 




1 

CL^ 


3 


dy 

= 4 

ci2 = l = 

3 


a w 

^12 

# 

"17 % 

A = 

a 2\ 

^22 

B = 

14 5 

3x2 

a 3l 

ayi - 

3x2 

13 9 



180 Part One Simple Linear Regression 


Regression 

Examples 


then A = B implies: 

ci\\ ~ \1 a\2 = 2 

a 2 \ = 14 a 2 2 = 5 

^31 = 13 ^32 = 9 

In regression analysis, one basic matrix is the vector Y, consisting of the « observations on 
the response variable: 


Y = 

rtX 1 



lY n ] 


Note that the transpose Y 1 is the row vector: 



(5.4) 


^ = [7, Y 2 --- Y n ] (5.5) 

lxn .，夂 

Another basic matrix in regression analysis is the X matrix, which is defined as follows for 
simple linear regression analysis: 


"1 X, 
1 X 2 
X =.. 

hx2 ; ; 


(5.6) 


The matrix X consists of a column of Is and a column containing the n observations on the 
predictor variable X. Note that the transpose of X is: 


X 7 =； 

2xn A i 


x 2 


x n 


(5.7) 


The X matrix is often referred to as the design matrix. 


5.2 Matrix Addition and Subtraction 


Adding or subtracting two matrices requires that they have the same dimension. The sum, 
or difference, of two matrices is another matrix whose elements each consist of the sum, or 
difference, of the corresponding elements of the two matrices. Suppose: 



"i 

4' 


"i 

2' 

A = 

2 

5 

B = 

2 

3 

3x2 

3 

6 

3x2 

3 

4 


then: 



"l + i 

4 + 2' 


"2 

6" 

A + B = 

2 + 2 

5 + 3 

= 

4 

8 

3x2 

3 + 3 

6 + 4 


6 

10 



Similarly: 


Chapter 5 Matrix Approach to Simple Linear Regression Analysis 181 


0 2 

0 2 

0 2 
■ _ 

In general, if: 

A = [a"] B = [btj] / = 1 ， …， j = 1 ，…， c 

rxc rxc 

then: 

A + B = [aij + bij] and A — B = \a V} — bij] (5.8) 

rxc rxc 


A-B 

3x2 


1 一 1 4-2 

2- 2 5-3 

3- 3 6-4 


Formula (5.8) generalizes in an obvious way to addition and subtraction of more than two 
matrices. Note also that A + B = B + A ， as in ordinary algebra. ^ 

The regression model: 

Exampk Ff = £{K} + e f i = 1_ n 

〆 

can be written compactly ixi matrix notation. First, let us define the vector of the mean 
responses: 


「卿 1 

E{Y 2 ] 

E{Y} = 

rtX I . 

.E{Y n }_ 

and the vector of the error terms: 




£2 

e = . 

nx 1 • 




(5.9) 


(5.10) 


Recalling the definition of the observations vector Y in (5.4), we can write the regression 
model as follows ： 

Y =E{Y } 十 e 

• nx] n xi nxi 

because ： t 

— ~E{Y,}+e x - 

E[Y 2 }^-€ 2 

^ 二 

_E{Y n } + s n _ 

Thus, the observations vector Y equals the sum of two vectors, a vector containing the 
expected values and another containing the error terms. 


Yi 


E[Y X Y 


e l 


y 2 

—— 

E[Y 2 ] 

+ 

n 

—— 

Jn. 


*• 

E{y n }. 


Sn _ 




182 Part One Simple Linear Regression 


kA = k 


2 

1 


■2k 

Ik 

9 

3 

—〆 

9k 

3k 


where k denotes a scalar. 

If every element of a matrix has a common factor, this factor can be taken outside the 
matrix and treated as a scalar. For example: 


9 27 
15 18 


= 3 


3 9 
5 6 


Similarly: 


In general, if A = [a ； y] and k is a scalar, we have: 

kA = \k = [kaij] (5.11) 

Multiplication of a Matrix by a Matrix 

Multiplication of a matrix by a matrix may appear somewhat complicated at first, but a little 
practice will make it a routine operation. 

Consider the two matrices: 

A = [ 2 5 

2x2 [4 1 

The product AB will be a 2 x 2 matrix whose elements are obtained by finding the cross 
products of rows of A with columns of B and summing the cross products. For instance, to 
find the element in the first row and the first column of the product AB, we work with the 


B = 

4 6 

c O 

2x2 

5 o 


Multiplication of a Matrix by a Scalar 

A scalar is an ordinary number or a symbol representing a number. In multiplication of a 
matrix by a scalar, every element of the matrix is multiplied by the scalar. For example, 
suppose the matrix A is given by: 


A = 


2 7 
9 3 


Then 4A, where 4 is the scalar, equals: 


4A = 4 


"2 

1 


8 

28' 

9 

3 


36 

12 


Similarly, kA equals: 




5.3 Matrix Multiplication 


2 8 


5 3 


1 I 众 
- 


2 I 众 8 I 众 
5 I 众 3 I 众 



Chapter 5 Matrix Approach to Simple Linear Regression Analysis 183 


first row of A and the first column of B, as follows: 


A 

Row 1 j2 5 
Row 2 4 j 


B 


4 

6 

5 

8 


Col. 1 Col. 2 


AB 

Row 1 33 

Col. 1 


We take the cross products and sum: 


2(4) + 5(5) = 33 

The number 33 is the element in the first row and first column of the matrix AB. 

To find the element in the first row and second column of AB, we work with the first row 
of A and the second column of B: 


A B AB 


Row 1 


2 5 

•- 


4 

6 

Row 1 

"33 52 

Row 2 

4 1 


5 

8 




Col. 1 Col. 2 Col. 1 Col. 2 


The sum of the cross products is: 

2(6) + 5(8) = 52 

Continuing this process, we find the product AB to be: 


AB = 

2x2 


"2 5" 


'4 6' 


"33 52' 

4 1 


5 8 


21 32 


i 


Let us consider another example: 


A 

2x3 


AB 

2x1 





"3" 

■13 

4' 



0 5 

8 

B = 

3x1 

5 

2 



， 1 

"3" 

C 



1 3 

4 



'26' 

0 5 

8 


3 

2 


41 







When obtaining the product AB, we say that A is postmultiplied by B or B is premultiplied 
by A, The reason for this precise terminology is that multiplication rules for ordinary algebra 
do not apply to matrix algebra. In ordinary algebra, xy = yx. In matrix algebra, AB ^ BA 
usually. In fact, even though the product AB may be defined, the product BA may not be 
defined at all. 

In general, the product AB is defined (inly when the number of columns in A equals the 
number of rows in B so that therfe will be corresponding terms in the cross products. Thus, 
in our previous two examples, we had: , 


Equal 

a/\b = A’B 

2x2 2x2 2x2 

\ / 

Dimension 
of product 


Equal 

Aj/\b" = AB 

2x3 3x1 2x1 

\ / 

Dimension 
of product 




84 Part One Simple linear Regression 


Note that the dimension of the product AB is given by the number of rows in A and the 
number of columns in B. Note also that in the second case the product BA would not be 
defined since the number of columns in B is not equal to the number of rows in A: 

Unequal 

B A 

3xl 2x3 


Here is another example of matrix multiplication: 


AB = 


a \\ 

灯 2\ 


a!2 

^22 


a \3 

^23 


-1 


bn 


办 21 

办 22 

- 

_^31 

办 32 


^ 11^11 + ^ 12^21 + ^ 13^31 ^ 11^12 + ^ 12^22 + a \ 3 ^ n \ 

^ 2 \ b \\ + ^ 22^21 + a 23 ^ 3 \ ^ 21^12 + ^ 22^22 + ^ 23^32 


In general, if A has dimension r x c and B has dimension c x 5, the product AB is a matrix 
of dimension r x 5 whose element in the ith row and ^'th column is: 

C 

〉 : ^ikJ^kj 

k=l 


SO that: 


AB = 

rxs 


c 

aik ^ k i 

k=\ 


=1, 




nj 



(5-12) 


Thus, in the foregoing example, the element b the first row and second column of the 
product AB is: 


3 

'^2/ a \kt>kl = ^ 11^12 + <312^22 + ^ 13^32 
k=\ 

as indeed we found by taking the cross products of the elements in the first row of A and 
second column of B and summing. 


Additional 

Examples 


'4 2 


a\ 


4^1 H - 2^2 

5 8 


Cl2 


5d\ ~h 8^2 


2 

2. [2 3 5] 3 = [2 2 + 3 2 十 5 2 ] = [38] 

5 

_ _ 

Here, the product is a 1 x 1 matrix, which is equivalent to a scalar. Thus, the matrix product 
here equals the number 38. 



"1 xr 


"j 6 0 " 


"j 6 0 + j 6 ,Xr 

3. 

1 x 2 


= 

A) + ^1^2 


1 x 3 



_A ) 十 



Chapter 5 Matrix Approach to Simple Linear Regression Analysis 185 


A = 

3x3 






d\ 

0 

0" 


0 

d2 

0 

B = 

_0 

0 

CL^ 

4x4 






Certain special types of matrices arise regularly in regression analysis. We consider the 
most important of these. 

Symmetric Matrix 

If A = A', A is said to be symmetric. Thus, A below is symmetric: 



1 

4 

6' 


i 

4 

6" 

A = 

4 

2 

5 

A’ = 

4 

2 

5 

3x3 

6. 

5 

3 

3x3 

6 

5 

3 


A symmetric matrix necessarily is square. Symmetric matrices arise typically in regression 
analysis when we premultiply a matrix, say, X, by its transpose, X r . The resulting matrix, 
X’X, is symmetric, as can readily be sefen from (5.14). 

Diagonal Matrix 

A diagonal matrix is a square matrix whose off-diagonal elements are all zeros, such as: 


Regression 

Examples 


A product frequently needed is Y Y, where Y is the vector of observations on the response 
variable as defined in (5.4): 


Y^Y = [F, Y 2 … Y n ] 


Y ： 

Y 2 

lY tl 


[if + r 2 2 + 〜 + e] = [J> 2 ] (5.13) 


Note that Y'Y is a 1 x 1 matrix, or a scalar. We thus have a compact Way of writing a sum 
of squared terms: Y^Y — ^ Yf. 

We also will need X’X, which is a 2 x 2 matrix, where X is defined in (5.6): 


X r X = 

2x2 


and X’Y, which is a 2 x 1 matrix ： 




i xr 



1 1 --- 1 " 


1 X 2 


■n W 

.X, X 2 --- 


.i 


-EA K 


i (5.14) 


XY = 

2x1 


X, x 2 


X n 


Y, 


Y n 



y 2 


■ 1 


•• 




(5.15) 


5.4 Special Types of Matrices 


I - 

0 0 0 5 

o o 10 o 
o 1 o o 

4 0 0 0 


186 Part One Simple Linear Regression 


We will often not show all zeros for a diagonal matrix, presenting it in the form: 


A = 

3x3 


(X\ 

0 

B = 

4x4 

*4 _ 

0 

1 

0 

a 2 

10 

0 

- 

CL^ 


5 


Two important types of diagonal matrices are the identity matrix and the scalar matrix. 


Identity Matrix. The identity matrix or unit matrix is denoted by I. It is a di agonal matrix 
whose elements on the main diagonal are all Is. Premultiplying or postmuUi^lying any rxr 
matrix A by the r x r identity matrix I leaves A unchanged. For example: 



"1 

0 

0" 


an 

Cl\2 

a l3 


a \\ 

^12 

a l3 

IA = 

0 

1 

0 


a 2\ 


^23 

= 

^21 

^22 

a 23 


0 

0 

1 


_ a 3\ 

a 3 2 

a 33_ 

_ a 3l 

a 3 2 

a 33_ 


Similarly, we have: 



an 


a \3 


"1 

0 

0" 



^12 


AI = 

^21 

^22 

^23 


0 

1 

0 

= 

^21 

G-22 

“23 


_ a 3\ 

^32 

^33 _ 


0 

0 

1 


^31 

^32 

a 33_ 


Note that the identity matrix I therefore corresponds to the number 1 in ordinary algebra, 
since we have there that l ' x = x • l = x. 

In general, we have for any rxr matrix A: 

AI = IA = A (5.16) 


Thus, the identity matrix can be inserted or dropped from a matrix expression whenever it 
is convenient to do so. 


Scalar Matrix. A scalar matrix is a diagonal matrix whose main-diagonal elements are 
the same. Two examples of scalar matrices are: 


2 0 
0 2 


k 0 0 
0 k 0 
0 0k 


A scalar matrix can be expressed as kl, where k is the scalar. For instance: 


2 

0 


0 

2 


= 2 


0 


= 21 


k 0 0 
0 k 0 
0 0k 



0 

0 =kl 

1 


Multiplying an r x r matrix A by the r x r scalar matrix kl is equivalent to multiplying 
A by the scalar/:. 



Chapter 5 Matrix Approach to Simple Linear Regression Analysis 187 





188 Part One Simple Linear Regression 


5.5 Linear Dependence and Rank of Matrix__ 

Linear Dependence 

Consider the following matrix: 

"1 2 5 1" 

A = 2 2 10 6 

3 4 15 1 

_ _ 

Let us think now of the columns of this matrix as vectors. Thus, we view A as being made 
up of four column vectors. It happens here that the columns are interrelated in a special 
manner. Note that the third column vector is a multiple of the first column vecto * 广 

5] I" 1" 

10 = 5 2 

15 3 

Id —sr 

We say that the columns of A are linearly dependent. They contain redundant information, 
so to speak, since one column can be obtained as a linear combination of the others. 

We define the set of c column vectors C|,..., C c in an r x c matrix to be linearly 
dependent if one vector can be expressed as a linear combination of the others. If no vector 
in the set can be so expressed, we define the set of vectors to be linearly independent. A 
more general, though equivalent, definition is: 

When c scalars k', ， k c , not all zero, can be found such that: 

k\C\ + 灸 2^2 + • • • + 众 c C c = ❶ 

where 0 denotes the zero column vector, the c column vectors are linearly (5.20) 
dependent. If the only set of scalars for which the equality holds is 
k\ = 0,..., k c = 0, the set of c column vectors is linearly independent. 

To illustrate for our example, k\ = 5,k 2 = 0, k 3 = —1, k 4 =0 leads to: 

"li [ 2 ] r 5 ] rli ro" 

52+02-1 10 +06 = 0 

3 」 L 4 」 L 15 J L 1 」 1_ 0 

' Hence, the column vectors are linearly dependent. Note that some of the kj equal zero here. 
For linear dependence, it is only required that not all kj be zero. 

Rank of Matrix 

The rank of a matrix is defined to be the maximum number of linearly independent columns 
in the matrix. We know that the rank of A in our earlier example cannot be 4, since the four 
columns are linearly dependent. We can, however, find three columns (1, 2, and 4) which 
are lineally independent. There are no scalars A:! ： 2, 灸 4 such that A：i C! + = 0 

other than k\ = k 2 = = 0. Thus, the rank of A in our example is 3. 

The rank of a matrix is unique and can equivalently be defined as the maximum number 
of linearly independent rows. It follows that the rank of an r x c matrix cannot exceed 
min(r, c), the minimum of the two values r and c. 


Chapter 5 Matrix Approach to Simple Linear Regression Analysis 189 


• • m Jlr 

In ordinary algebra, the inverse of a number is its reciprocal. Thus, the inverse of 6 is —. A 
number multiplied by its inverse always equals 1: 




6 


- 6=1 


1 -1 —1 i 

x • — = X - X = X - X = 1 
X 

In matrix algebra, the inverse of a matrix A is another matrix, denoted by A -1 , such th^t: 

A _l A = AA _1 = I (5.21) 

where I is the identity matrix. Thus, again, the identity matrix I plays the same role as the 
number 1 in ordinary algebra. An inverse of a matrix is defined only for square matrices. 
Even so, many square matrices do not have inverses. If a square matrix does have an inverse, 
the inverse is unique. 


Examples 


1. The inverse of the matrix: 




2 4 

3 1 


-.1 .4 

.3 — .2 


since: 


A _l A 


lA — 


-.1 .4 2 4 

.3 -.2 3 1 


"1 O' 

0 1 


A A -1 = 


2 4 


-.1 A 
.3 -.2 


2. The inverse of the matrix: 


When a matrix is the product of two matrices, its rank cannot exceed the smaller of the 
two ranks for the matrices being multiplied. Thus, if C = AB, the rank of C cannot exceed 
min (rank A, rank B). 

6 Inverse of a Matrix 



190 Part One Simple Linear Regression 


since: 


A _1 A 


3 

0 


0 0 


4 

0 0 


2」 


3 0 0 


"l o 0" 

0 4 0 

= 

0 1 0 

0 0 2 


0 0 1 


Note that the inverse of a diagonal matrix is a diagonal matrix consisting simply of the 
reciprocals of the elements on the diagonal. 

Finding the Inverse 

Up to this point, the inverse of a matrix A has been given, and we have only checked to 
make sure it is the inverse by seeing whether or not A -1 A = I. But how does one find the 
inverse, and when does it exist? 

An inverse of a square r x r matrix exists if the rank of the matrix is r. Such a matrix is 
said to be nonsingular or of full rank. An r x r matrix with r^nk less than r is said to be 
singular or not of full rank, and does not have an inverse. The inverse of an r x r matrix of 
full rank also has rank r. 

Finding the inverse of a matrix can often require a large amount of computing. We shall 
take the approach in this book that the inverse of a 2 x 2 matrix and a 3 x 3 matrix can 
be calculated by hand. For any larger matrix, one ordinarily uses a computer to find the 
inverse, unless the matrix is of a special form such as a diagonal matrix. It can be shown 
that the inverses for 2 x 2 and 3x3 matrices are as follows: 

1. If: 

a b 
c d 


A = 

2x2 


then: 


A~ 

2x2 


a b 
c d 


where: 


D = ad — be 


' d 

-b~ 


D 

D 

(5.22) 

—c 

a 

L D 

D . 


be 


(5.22a) 


D is called the determinant of the matrix A. If A were singular, its determinant would equal 
zero and no inverse of A would exist. 

2. If: 


then: 


B = 

3x3 


3x3 



a 

b 

c 


A 

B 

C 

1 一 

d 

e 

f 

— 

D 

E 

F 


8 

h 

k 


G 

H 

K 


(5 23) 


- 



Chapter 5 Matrix Appwach to Simple Linear Regression Analysis 191 


where: 

A = (ek- fh)/Z 

B = -{bk- ch)/Z 

C = 

(bf - ce)/Z 


D = —{dk — fg)/Z 

E = (ak — eg) / Z 

F = 

~( a f ~ cd)/Z 

(5.23a) 

G = (dh — eg)/Z 

H = — {ah — bg)/Z 

K = 

(ae — bd) / Z 


and: 





Z = a(ek - 

- fh) — b(dk — fg) + c{dh - 

- eg) 

(5.23b) 


Z is called the determinant of the matrix B. 


Let us use (5.22) to find the inverse of: 


A = 


2 

3 


4 


We have ： 


a = 2 b = A 
c = 3 d= 1 


Regression 

Example 


Hence: 


D = ad - be = 2(1) - 4(3) = -10 



- 1 

-10 

_4 _ 

-10 


-.1 

.4 

-3 

2 


.3 

-.2 

L -10 

- 10 」 





as Was given in an earlier example. 

When an inverse A -1 has been obtained by hand calculations orfrom a computerprogram 
for which the accuracy of inverting a matrix is not known, it may be wise to compute 
A -1 A to check whether the product equals the identity matrix, allowing for minor rounding 
departures from O atid 1. 

The principal inverse matrix encountered in regression analysis is the inverse of the matrix 
X'X in (5.14): 


Using rule (5.22), we have: 


n 

X f X = 


2x2 IE^ 




a = n b = Y^^i , 


so that: 


^- E - (E (E = - [E 


(E^) : 


n 


n 





192 Part One Simple Linear Regression 


Hence: 

(xy)= 

2x2 — 2^ Xi Yl 

_ n Y：^i - X) 2 nZd - 奸 _ 

Since X ； = nX and ^(X ( - -X) 2 = Y J Xf- nX 2 , we can simplify (5.24): 

X 2 -X 

十 Ed - 奸 TXK - X) 2 

-X 1 

E (沁-叉 ) 2 E ( 兄一叉 ) 2 

Uses of Inverse Matrix 

In ordinary algebra, we solve an equation of the type:' 

5y = 加 

by multiplying both sides of the equation by the inverse of 5, namely: 

^(5j) = |(20) 

and we obtain the solution: 

y = ^(20) = 4 

In matrix algebra, if we have an equation: 

AY = C 

we correspondingly premultiply both sides by A -1 , assuming A has an invers 

A 1 AY = A _l C 

Since A -1 AY = IY = Y, we obtain the solution: 

=A~ l C 

To illustrate this use, suppose we have two simultaneous equations ： 

2y t + 4y 2 = 20 
3^i + J 2 = 10 

which can be written as follows in matrix notation: 

2 4] [j,l _ [20" 

3 1 y 2 _ _ 10 

The solution of these equations then is: 

"jil _ [2 4l~' [20" 
y 2 ~ 3 1 10 













Chapter 5 Matrix Approach to Simple linear Regression Analysis 193 


Expectation of Random Vector or Matrix 

Suppose we have n — 3 observations in the observations vector Y* 


Y = 

3x1 


The expected value of Y is a vector, denoted by E{Y}, that is defined as follows: 


a+b=b+a 

(A + B) + C = A + (B + C) 
(AB)C = A(BC) 

C(A + B) = CA + CB 
/ /：(A + B) + m 

( A r y = a 

(A + B) / = A' + B' 
(AB) / = BA / 

(ABC/ = C'BA' 

(AB 厂 1 =B-'A~ l 
(ABC) -1 = CH- 1 
(A- 1 ) -1 = A 

(Ar 1 = (A- 1 / 

5.8 Random Vectors and Matrices 


E{Y} = 

3x1 


A random vector or a random matrix contains elements that are random variables. Thus ， 
the observations vector Y in (5.4) is a random vector since the Y t elements are random 
variables. * 


Earlier we found the required inverse, so we obtain:' 




-.1 

A' 


'20' 


'2" 



.3 

-.2 


10 


4 


Hence, y\ =2 and y 2 = 4 satisfy these two equations. 

5.7 Some Basic Results for Matrices_ 

We list here, without proof, some basic results for matrices which we will utilize in later 
work. 


E{Y X } 

E[Y 2 } 

E{Y 3 } 


F1-F2F3 


\J \J \J \J \J \J \J \J \J 
5678901234567 
2222233333333 

(s.(s.(s.(s.(5. ( 5 .( 5 . (s.(s.(s.(s.(s.(s. 




1p4 Part One Simple Linear Regression 


Thus, the expected value of a random vector is a vector whose elements are the expected 
values of the random variables that are the elements of the random vector. Similariy, the 
expectation of a random matrix is a matrix whose elements are the expected values of the 
corresponding random variables in the original matrix. We encountered a vector of expected 
values earlier in (5.9). 

In general, for a random vector Y the expectation is: 


E{Y} = [ 邮 }] 


(5.38) 


Regression 

Example 


and for a random matrix Y with dimension n x p, the expectation is: 


，=[卿] 


l - 1， . . . ^ AZg J - 1 1， . . . ) 


(5.39) 


Suppose the number of cases in a regression application is n = 3. The three error terms ， 
S 2 , £3 each have expectation zero. For the error terms vector: 


we have: 


E{e}= 0 

3xl 3x1 


since: 


Variance-Covariance Matrix of Random Vector 

Consider again the random vector Y consisting of three observations F 3 . The variances 

of the three random variables, <r 2 {Y i } 1> and the covariances between any two of the random 
variables, Fy}, are assembled in the variance-covariance matrix of Y, denoted by 

a 2 {Y}, in the following form: 

' " cr 2 [Y,} a{Y l ,Y 2 ] a[Y u Y 3 }' 

cr 2 {Y}= a{F 2 ,F,} a 2 {Y 2 } a{Y 2 , Y 3 } (5.40) 

^{Y^Y,} a{Y 3 , Y 2 } a 2 {Y 3 } _ 

Note that the variances are on the main diagonal, and the covariance cr{Yi, Yj} is found 
in the ith row and jth column of the matrix. Thus, cr{Y 2 , Y x ] is found in the second row, 
first column, and c{Fi, Y 2 ] is found in the first row, second column. Remember, of course, 
thata{F 2 , Y { } = <r{Fi, F 2 }- Since Yj} = cr{Yj t F, } for all i 7 ^ j, c 2 {Y} is a symmetric 
matrix. 

It follows readily that: 

or 2 {Y} = E{[Y- E{Y}][Y - E{Y} ； T} (5.41) 




(h — £ 出 }) 
{Y,-£{Y,})(Y 2 - 

{Y 2 -B{Y 2 }){Y,- 

etc. 


Term 


Expected Value 

o 2 {Y,} 
cr{Y U Y 2 ) 
cr{Y, t r 3 ) 
cr{Y 2f Y,} 
etc. 


Location in Product 

Row 1, column 1 
Row 1, column 2 
Row 1, column 3 
Row 2, column 1 
etc. 


This, of course, leads to the variance-covariance matrix in (5.40). Remember the definitions 
of variance and covariance in (A.15) and (A.21), respectively, when taking expectations. 
To generalize, the variance-covariance matrix for an /2 x 1 random vector Y is: 


a 2 ⑹ a{Y l ,Y 2 } 
a 2 {Y 2 ] 


o 2 m = 


cr{Y 2 , Y n } 


(5.42) 


Regression 

Example 


La(m} a{Y n ,Y 2 ] 
Note again that a 2 {Y} is a symmetric matrix. 


cr 2 [Y n } 


Let us return to the example based onn =3 cases. Suppose that the three error terms have 
constant variance, o* 2 {e ( } = o* 2 , and are uncorrelated so that £j] = 0 fori ^ j. The 
variance-covariance matrix for the random vector e of the previous example is therefore as 
follows: 


or 2 {e} = 0 a 2 0 
3x3 - [O 0 a 2 

Note that all variances are a 2 and all covariances are zero. Note also that this variance- 
covariance matrix is a scalar matrix, with the common variance o* 2 the scalar. Hence, We 
can express the variance-covariance matrix in the following simple fashion: 

(^{e} = cr 2 l 

3x3 3x3 


Since ： 


"1 0 O' 

a 2 I = a 2 0 1 0 

0 0 1 


VO 0 

0 a 2 0 
0 0 a 2 


Chapter 5 Matrix Approach to Simple Linear Regression Analysis 195 

For our illustration. We have: 

I 'Y.-EiY,}' 

y 2 - E{Y 2 ] \Y X -E{Y,} y 2 - E{Y 2 } y 3 - E{Y 3 }] 
_Y 3 -E{Y 3 }_ 

Multiplying the two matrices and then taking expectations, we obtain: 



IJ 1J IJ 

^^h 
n n n 

£ F £ 



196 Part One Simple Linear Regression 


Some Basic Results 

Frequently, we shall encounter a random vector W that is obtained by premultiplying the 
random vector Y by a constant matrix A (a matrix whose elements arc fixed): 


W = AY 


Some basic results for this case are: 

E{A} = A 

E{W} = E{AY} = AE{Y} 
(^{W} = a 2 {AY} = Act 2 {Y}A / 

where a 2 {Y} is the variance-covariance matrix of Y. 


(5-43) 

(5.44) 

(5.45) 

(5.46) 


Example 


As a simple illustration of the use of these results, consltler: 


'W]" 


■1 -1" 




'Yl-Y2 

w 2 


1 1 


Y 2 


^ + ¥2 






Y 

2x1 


We then have by (5.45): 

E{Y l }~E{Y 2 } 
E{Y x }^E{Y 2 } 

and by (5.46 )： 


cr 2 {W}= 

1 -r 


• a 2 {F,} a{Y u Y 2 y 


1 r 

2x2 

1 1 


ct 2 {Y 2 } 


-1 1 


E{W} 

2x1 


1 ~r 


^E{Yy} 


1 1 


_E[Y 2 }^ 



o- 2 {F,} + cr 2 {Y 2 ] - 加⑺， o- 2 (F,}- a 2 {Y 2 } 

o- 2 {F,}- cr 2 {Y 2 ] o- 2 (F,}+a 2 {F 2 } + 2a{¥^¥ 2 } 


Thus: 


cr 2 {W l } = o- 2 {F, - Y 2 } = a 2 {F,} + a 2 [Y 2 } - 2cr{Y l ,Y 2 } 
cr 2 {W 2 } = o * 2 ⑺ + r 2 } = o* 2 (F,} + a 2 {Y 2 } + 2cr{Y u Y 2 } 
a{W u W 2 } = a{Y, - V 2 , F, + V 2 } = ^{FJ- a 2 {Y 2 } 


Multivariate Normal Distribution 

Density Function. The density function for the multivariate normal distribution is best 
given in matrix form. We first need to define some vectors and matrices. The observations 



Chapter 5 Matrix Approach to Simple linear Regression Analysis 197 


vector Y containing an observation oxi each of the p Y variables is defined as usual: 



(5.47) 


The mean vector E{Y}, denoted by |Ji, contains the expected values for each of the p Y 
variables: 



(5.48) 

i 


Finally, the variance-covariance matrix c 2 {Y} is denoted by Z and contains as always the 
variances and covariances of the p Y variables; 


£ = 

v 乂 v 


0*1 0*12 

0*21 C*2 2 


(5.49) 


Here, cr^ denotes the variance of Y { , a l2 denotes the covariance of Y\ and Y 2 , and the like. 
The density function of the multivariate normal distribution can now be stated as follows: 


/(Y)= ( 2 tt )^| X |*/2 CXP 


(YlWl) 


(5.50) 


Here, |X| is the determinant of the variance-covariance matrix Z. When there are /? = 2 
variables', the multivariate normal density function (5.50) simplifies to the bivariate normal 
density function (2.74). 

The multivariate normal density function has properties that correspond to the ones de¬ 
scribed for the bivariate normal distribution. For instance, if F ls ..., F p are jointly normally 
distributed (i.e., they follow the multivariate normal distribution), the marginal probability 
distribution of each variable Y k is normal, with mean (Xk and standard deviation 


Simple Linear Regression Model in Matrix Terms 


We are now ready to develop simple linear regression in matrix terms. Remember again that 
we will not present any new results, but shall only state in matrix terms the results obtained 
earlier. We begin with the normal error regression model (2.1): 

K=A) + A^+e ; i = l . n (5.51) 



98 Part One Simple Linear Regression 


This implies: 


^1 — + ^ 1^1 + £ i 

^2 = A ) + ^ 1^2 + £2 
Y tl = A> + + e « 


(5.51a) 


We defined earlier the observations vector Y in (5.4), the X matrix in (5.6), and the e vector 
in (5.10). Let us repeat these definitions and also define the p vector of the regression 
coefficients; 



■ v 
y 2 


■i xr 

1 x 2 


"j6 0 ' 

n 

4 

£2 

Y = 


X = 

• 峰 

p = 

e = 


AZXl 

人 

«x2 

.1 X n _ 

2x1 

■■^sr 

_ 

AZXl 



Now we can write (5.51a) in matrix terms compactly as follows: 

Y = X P + e 

/zx 1 nx2 2xI nx 1 


since ： 


「V 


■1 


~S\~ 

y 2 


1 x 2 


"j6 0 " 

■ 

£2 



• • 


A. 

十 

■ 

又 


-1 




(5.52) 


(5.53) 


Po 十 PiX' 




A) 十 AXi + 

A> 十 AX 2 

+ 

Si 

—— 

A) + ^1-^2 + ^2 





.A) + ^1-^n + £ n. 


Note that Xp is the vector of the expected values of the Ff observations since E{Yi} = 
j6 0 + hence: 


E{Y} = XP (5.54) 

nxl azxI 

where E{Y} is defined in (5.9). 

The column of Is in the X matrix may be viewed as consisting of the constant Xq = 1 
in the alternative regression model (1.5): 、 

Yi — j6qX() + + £{ where Xq = 1 

Thus, the X matrix may be considered to contain a column vector consisting of Is and 
another column vector consisting of the predictor variable observations X/. 

With respect to the error terms, regression model (2.1) assumes that E{ei} = 0, a 2 {€i} — 
a 2 , and that the e t are independent normal random variables. The condition ^= 0 in 



matrix terms is: 


Chapter 5 Matrix Approach to Simple Linear Regression Analysis 199 


E{e}= 0 

nx I 


(5.55) 


nx] 


since (5.55) states: 


'£{€,}- 


" 0 " 

E{e 2 ] 

—— 

0 

.E{£ n ). 


0 _ 


The condition that the error terms have constant variance o * 2 and that all covariances 
< 7 {£/, £j] for i 7^7 are zero (since the £/ are independent) is expressed in matrix terms 
through the variance-covariance matrix of the error terms: 


cr 2 {e}= 

nxn 


a 

0 


2 


<y z 


0 

0 


0 0 


0 

0 


G 1 


l 


(5.56) 


Since this is a scalar matrix. We know from the earlier example that it can be expressed in 
the following simple fashion: 

a 2 {e} = a 2 I 


nxn 


nxn 


(5.56a) 


Thus, the normal error regression model (2.1) in matrix terms is: 

Y = Xp 十 e 

where: 

e is a vector of independent normal random variables with E{e} = 0 and 
a 2 {e} = a 2 I 


(5.57) 


5.10 Least Squares Estimation of Regression Parameters 


Normal Equations 

The normal equations (1.9): 


nb 0 + = I2 r i 

in matrix terms arc: 


XX b =X'Y 

, 2x2 2x1 2x1 


where b is the vector of the least squares regression coefficients: 


b = 

2x1 


bo 


(5.58) 

(5.59) 


(5.59a) 



200 Part One Simple Linear Regression 


To see this, recall that we obtained X’X in (5.14) and X'Y in (5.15). Equation (5.59) thus 
states: 


or: 


n 

n E 政 


V 

— 


nb 0 + b x 

_ ^0 

= 

_ EK ■ 
■EM 、 


These are precisely the normal equations in (5.58). 



Estimated Regression Coefficients 

To obtain the estimated regression coefficients from the normal equations (5.59) by matrix 
methods, we premultiply both sides by the inverse of (we assume this exists): 

(X f X)~ l X f Xb = (TX)- l X'Y 

We then find, since (X'X) _1 X'X = I and lb = b: 


b = (X ^) -1 XY 

2x1 2x2 2x l 


(5.60) 


The estimators b 0 and ^ in b are the Same as those given earlier in (1.10a) and (1.10b). We 
shall demonstrate this by an example. 


Example 


We shall use matrix methods to obtain the estimated regression coefficients for the Toluca 
Company example. The data on the Y and X variables Were given in Table 1,1, Using these 
data, we define the Y observations vector and the X matrix as follows ： 


(5.61a) Y = 

"399" 

121 

(5.61 b) X = 

"1 80" 
1 30 


323 _ 


l 70_ 


We now require the following matrix products: 


X，X = 


1 80 


1 1 … r 


i 30 


' 25 

1,750' 

80 30 ••• 70 


- - 


1,750 

142,300 


I 70 


(5.61) 


(5.62) 


XY = 




'399' 



1 1 … r 


121 


" 7,807' 

80 30 … 70 


,323. 


617,180 


(5.63) 



Chapter 5 Matrix Approach to Simple Linear Regression Analysis 201 


Using (5.22), we find the inverse of X’X: 


QHxy 1 = 


•287475 

-.003535 


-.003535 

.00005051 


(5.64) 


In subsequent matrix calculations utilizing this inverse matrix and other matrix results, we 
shall actually utilize more digits for the matrix elements than are shown. 

Finally, we employ (5.60) to obtain: 


b = 


h 

h 


= (x'xr w = 


.287475 

-.003535 


-.003535 

.00005051 


7,807 

617,180 


62.37 

3.5702 


(5.65) 


or fc 0 = 62.37 and bi = 3.5702. These results agree with the ones in Chapter L Any differ¬ 
ences would have been due to rounding effects. 


Comments 

1. To derive the normal equations by the method of least squares, we minimize the quantity: 

= —(A) + A 足 )] 2 

In matrix notation: 

2 = (Y — XP)，(Y — XP) (5.66) 

Expanding, we obtain: 


Q = Y f Y- P'X'Y - Y'XP + PWP 

since (XP)’ = P’X’ by (5.32). Note now that Y’Xp is 1 x 1, hence is equal to its transpose, which 
according to (5.33) is P’X’Y. Thus, we find: 

Q = Y r Y~ 2P / X / Y + P'X'XP (5.67) 


To find the value of p that minimizes Q, we differentiate with respect to j6 0 and Let; 


Then it follows that: 


「⑽ 1 


盖 02 卜 


ay3 0 

a<2 



^{Q) =-2S!Y+ 2K!l^ 


(5.68) 


(5.69) 


Equating to the zero vector, dividing by 2, aijd substituting b for p gives the matrix form of the least 
squares normal equations in (5.59). 

2. A comparison of the normal equations mid X’X shows that wheneva- the columns of X’X are 
linearly dependent, the normal equations will be linearly dependent also. No unique solutions can 
then be obtained for bo and bi ， Fortunately, in most r^ression applications, the columns of X’X are 
linearly independent, leading to unique solutions for b 0 and b\. ■ 



202 Part One Simple Linear Regression 


5.11 Fitted Values and Residuals 


Fitted Values 

Let the vector of the fitted values F ； be denoted by Y: 



fiX 1 - 

上 _ 


In matrix notation, we then have ： 


because: 

r 2 


Y n 


t = X b 

txI «x2 2x1 


'1 X,' 



^0 + 

1 x 2 


V 


办 0 十 

• • 

• • 


b K 


l 

.1 总 



_bQ-\- bi X„ _ 


(5.70) 



(5.71) 


Example 


For the Toluca Company example, we obtain the vector of fitted values using the matrices 
in (5.61b) and (5.65): 


1 

. 1 
Y = Xb= ; 

1 


80 

30 


62,37 

3.5702 


347.98 

169.47 


70 」 |_31Z28 


(5.72) 


The fitted values are the same, of course, as in Table 1.2. 

Hat Matrix. We can express the matrix result for Y in (5.71) as follows by using the 
expression for b in (5.60): 



Y = XiXXy^Y 


or, equivalently: 

Y = H Y 

(5.73) 


nx 1 AZXft AZX 1 


where ： 

H = XCX^X)" 1 ^ 

(5.73a) 


We see from (5.73) that the fitted values Y t can be expressed as linear combinations of 
the response variable observations F；, with the coefficients being elements of the matrix 
H. The H matrix involves only the observations on the predictor variable X, as is evident 
from (5.73a). 

The square nxn matrix H is called the hat matrix. It plays an important role in diagnostics 
for regression analysis, as we shall see in Chapter 10 when we consider whether regression 



Chapter 5 Matrix Approach to Simple Linear Regression Analysis 203 


Residuals 


Example 


results arc unduly influenced by one or a few observations. The matrix H is symmetric and 
has the special property (called idempotency): 

HH = H (5.74) 

In general, a matrix M is said to be idempotent if MM : =M, 


Let the vector of the residuals e ； = Y t — Y t be denoted by e: 


In matrix notation, we then have: 


ei 

e = . 

nx 1 : 


e = Y — Y = Y - Xb 

nx{ «xl nx\ 


(5.75) 


l 


(5.76) 


For the Toluca Company example, we obtain the vector of the residuals by using the results 
in (5.61a) and (5.72): 


"399" 


'347.98" 


' 51.02' 

121 

_ 

169.47 

— 

-48.47 

_323_ 


_312.28_ 


_ 10.72_ 


(5.77) 


The residuals are the same as in Table 1.2. 

Variance-Covariance Matrix of Residuals. The residuals e,-, like the fitted values F；, 


can be expressed as linear combinations of the response variable observations F,-, using the 
result in (5.73) for Y : 


e = Y-Y = Y-HY= (I-H)Y 
We thus have the important result: 

e = (I -H) Y (5.78) 

/zx 1 nxn nxn nx 1 


where H is the hat matrix defined in (5.53a). The matrix I — H, like the matrix H ， is 
symmetric and idempotent. ' 

The variance-covariance matrix of the vector of residuals e involves the matrix I — H: 

a 2 {e} = a 2 (l-B) (5.79) 

ttxn 


and is estimated by: 


s 2 {e} = 廳 £(1 — H) 


(5.80) 



204 Part One Simple Linear Regression 


Comment 

The variance-covariance matrix of e in (5.79) can be derived by means of (5.46). Since e = (I~ H)Y, 
we obtain: 

a 2 {e) = (I- H)or 2 〖 Y}(l ~ H)' 

Now cy 2 {Y} = o' 2 {e} =(t 2 I for the normal aror model according to (5.56a). Also, (I~H)’ = 
I — H because of the symmetry of the matrix. Hence ： 

or 2 {e) - cr 2 (l — H)I(1 — H) 

-二 cr 2 (l — H)(I - H) 

In view of the fact that the matrix I —H is idempotent, we know that^l — H) (I — H)= 
I — H and we obtain formula (5,79) - ■ 


5.12 Analysis of Variance Results 

Sums of Squares 

To see how the sums of squares are expressed in matrix notation, we begin with the total sum 
of squares SSTO, defined in (2.43). It will be convenient to use an algebraically equivalent 
expression: 

SSTTO = = (5.81) 

We know from (5.13) that: 

yy = Y2 y ^ 

The subtraction term Yi) 2 fn in matrix form uses J, the matrix of Is defined in (5.18), 
as follows: 

O Y JY (5.82) 

n \n / 

For instance, if n — Z, we have ： 

G) [Fi rd [i 1 

Hence, it follows that ： 

SSTO = Y Y- (I) Y JY (5.83) 

Just as E 疗 is represented by Y’Y in matrix terms, so SSE = Y2 e f — 5^(^- — %) 2 can 
be represented as follows: 

SSE = e'e = (Y — Xb) f (Y Xb) (5.84) 

which can be shown to equal: 

SSE = YY- bXY (5.84a) 


Y { _ (h 十 &)(& 十 r 2 ) 

Y 2 2 


Chapter 5 Matrix Approach to Simple Linear Regression Analysis 205 


Example 


Finally, it can be shown that: 

SSR = b X Y - ⑴ Y JY (5.85) 

Let us find SSE for the Toluca Company example by matrix methods, using (5.84a). Using 
(5.61a), We obtain: 


YY = [399 121 … 323] 


399 

121 

323 


= 2,745,173 


2,690,348 


and using (5.65) and (5.63)，we find: 

bXY = [62.37 3.5702] 

Hence ： 

SSE = YY -bXY = 2,745,173 - 2,690,348 = 54,825 


7,807 

617,180 


l 


which is the same result as that obtained in Chapter 1. Any difference would have been due 
to rounding effects. 


Comment 

To illustrate the derivation of the sums of squares expressions in matrix notation, consider SSE: 

SSE = = (Y — Xb)，(Y — Xb) = Y，Y — 2b'^!Y + bUb 

In substituting for the rightmost b we obtain by (5.60): 

SSE = YY- 21/X/Y + b^XC^X) -1 ^ 

=YY- 2WY + bIXY 

In dropping I and subtracting, we obtain the result in (5.84a). ■ 


Sums of Squares as Quadratic Forms 

The ANOVA sums of squares can be shown to be quadratic forms. An example of a quadratic 
form of the observations Y t when « = 2 is: 

57^ + 67^2 + 4F 2 2 (5.86) 


Note that this expression is a second-degree polynomial containing terms involving the 
squares of the observations and the cross product. We can express (5.86) in matrix terms as 
follows: 




"5 

3" 

'F," 

3 

4 



=YAY 


(5.86a) 


where A is a symmetric matrix. 



206 Parl^One Simple Linear Regression 


In general, a quadratic form is defined as: 

n n 

Y'AY = ^ ciijYiYj where a” = a” (5.87) 

lxl 

/=1 7=1 

A is a symmetric n x n matrix and is called the matrix of the quadratic form. 

The ANOVA sums of squares S^TO, SSE, and SSR are all quadratic forms, as can be 
seen by reexpressing b’X’. From (5.71), we know, using (5.32), that: 

bX = (Xb) / = Y f 

We now use the result in (5.73) tD obtain: 


b ， x' = my / 

Since H is a symmetric matrix so that H' = H, we finally obtain, using (5.32): 

bX' = YH 

This result enables us to express the ANOVA sums of squares as follows: 

-令 

OTO^Y^I-J| Y 

SSE=Ya-U)Y 


ssr = r 



j 


Y 


(5.88) 

(5.89a) 

(5.89b) 

(5.89c) 


Each of these sums of squares can now be seen to be of the form Y'AY, where the three A 
matrices are: 


1 一 u 

) 3 

(5.90a) 

I-H 


(5.90b) 

U - ( - 

) J 

(5.90c) 

Since each of these A matrices is symmetric, SSTO, Sk 

SE, and SSR are quadratic forms, 


with the matrices of the quadratic forms given in (5.90). Quadratic forms play an important 
role in statistics because all sums of squares in the analysis of variance for linear statistical 
models can be expressed as quadratic forms. 


5.13 Inferences in Regression Analysis_ 

As we saw in earlier chapters, all interval estimates are of the following form: point estimator 
plus and minus a certain number of estimated standard deviations of the point estimator. 
Similarly，all tests require the point estimator and the estimated standard deviation of the 
point estimator or，in the case of analysis of variance tests, various sums of squares. Matrix 
algebra is of principal help in inference making when obtaining the estimated standard 
deviations and sums of squares. We have already given the matrix equivalents of the sums 
of squares for the analysis of variance. We focus here chiefly on the matrix expressions for 
the estimated variances of point estimators of interest. 



Chapter 5 Matrix Approach to Simple Linear Regression Analysis 207 


Regression Coefficients 

The variance-covariance matrix of b: 


2fL1 cr 2 [bo} crih, bi) 

a {bj = „ 

2 x 2 cr{bub 0 } o 2 {b x ) 


(5.91) 


or 2 {b} = O^CX'X ) -1 


(5.92) 


Example 


or，from (5.24a): 


g 2 X 2 ~xa^ 

I + " E ( xT -~ x ) 2 E (^-^) 2 


~Xa 2 


^{b} 


一 Xg 2 


(5.92a) 


L E ( 兄一文 ) 2 E ( 兄 — 尤 ) 2 」 

When MSEh substituted fora 2 in (5.92a), We obtain the estimated variance-covariance 
matrix of b, denoted by s 2 {b}; 


MSE 


X 2 MSE 


—XMSE 


s 2 {b} = MS^(TX)- 

2x2 


"E(xT- _ x ) 2 E(Xi - 灯 


一 XMSE 


MSE 


(5.93) 


E(U 2 E(A - 尤 ) 2 


In (5.92a), you will recognize the variances of b 0 in (2.22b) and of b x in (2.3b) and the 
covariance of bo and b x in (4.5). Likewise, the estimated variances in (5.93) are familiar 
from earlier chapters. 

We wish to find 5 2 {fc 0 } and s 2 {bi} for the Toluca Company example by matrix methods. 
Using the results in Figure 2.2 and in (5.64), we obtain: 

職牌 1 = 2 ,384 f _:= _:= 51 


'685.34 —8.428 

-8.428 .12040 


(5.94) 


Thus, s 2 {b G ] = 685.34 and s 2 [by] — .12040. These are the same as the results obtained in 
Chapter 2. 


Comment f 

To daive the variance-covariance matrix of b, recall that: 

b = (X'XrWY .= AY 

where A is a constant matrix: 

a = (x'xr 1 ^ 

Hence, by (5.46) we have: 

or 2 {b) = A(T 2 {Y}A , 



208 Part One Simple Linear Regression 


Now or 2 {Y} = a 2 h Further, it follows from (532) and the fact that (X’X) _I is symmetric that: 

A，= X(X'X) _1 

We find therefore: 

a 2 {b) = (X'XfX'c^E^X'X)— 1 

=c^xWxtx'xr 1 

=a 2 (X ， X)-i 




Mean Response 

To estimate the mean response at Xh, let us define the vector: 


2x1 


Xh 


.卞 


or ― [1 X h ] 

.1x2 


The fitted value in matrix notation then is: 

Y h = X；b 


Since: 


x；b = [1 Xh\ 


b Q 

h 


= [b 0 + b l X h ] = [Y h ] = Y h 


■ 


(5.95) 


(5.96) 


Example 


Note that X^b is a 1 x 1 matrix; hence. We can write the final result as a scalar. 

The variance of given earlier in (2.29b), m matrix notation is: 

a 2 {Y h } = a 2 X h (XXy l X h (5.97) 

The variance of % in (5.93) can be expressed as a function of c 2 {b}, the variance-covariance 
matrix of the estimated regression coefficients, by making use of the result in (5.92): 

a 2 {%} =Xy{b}X h (5.97a) 

The estimated variance of Y h , given earlier in (2.30)，in matrix notation is: 

s 2 [Y h }^= MSEQL^QCXr'Xh) (5.98) 

We wish to find s 2 {Yh} for the Toluca Company example when Xh = 65. We define: 

X = [l 65] 


and use the result in (5.94) to obtain: 

s 2 {Y h ) = X ； s 2 {b}X, 




"685.34 -8.428 " 


r 

-8.428 .12040 


65 


= 98.37 


This is the same result as that obtained in Chapter 2. 



Chapter 5 Matrix Approach to Simple Linear Regression Analysis 209 


State the dimension of each resulting matrix. 

5.3. Show how the following expressions are written in terms of matrices: (1) F ； — iP) = e h 
(2) V] X t ei = 0. Assume i' = 1 ， ••• ， 4. 


State the dimension of each resulting matrix. 

5.2. For the matrices below, obtain (1) A + C ，（ 2) A — C ，（ 3) B'A ，（ 4) AC, (5) C’A. 


Comment 

The result in (5.97a) can be derived directly by using (5.46), since % = XJjb; 

cr 2 {Y h ] =X h 0 2 {b\X h 


Hence: 


cr 2 {Y h } = {\ X h ] 


o- 2 {^o} cr{^0, ^l} 1 

^{b u b 0 } c 2 {b{\ X h 


cr 2 [Y h } = o- 2 {i?o) + 2X h cr{bo, b x } + Xlor 2 {bi} 


Using the results from (5.92a)，we obtain: 


2 A a 2 a 2 X 2 2Xh(—X)a 2 

™ = T + E(U) 2 + E(^- - ^) 2 + E(& - 幻 2 


which reduces to the familiar expression; 


E( x ; — x) 2 


(5.99) 


(5.99a) 


Thus, we see explicitly that the variance expression in (5.99a) contains contributions from cr 2 {i>o}> 
o 2 {b\ }, and o {^ 0 -. ^1 }•> which it must according to (A.30b) since Yh = bo-\-b\Xh is a linear combination 
of b Q and bi- ■ 

Prediction of New Observation 

The estimated variance 5 2 {pred}, given earlier in (2.38), in matrix notation is: 

5 2 {pred} = MSE{\ 十 (5.100) 


Cited 5.1. Graybill, F. A. Matrices with Applications in Statistics. 2nd ed. Belmont, Calif.: Wadsworth, 

Reference 2002 


Problems 5.1. For the matrices below, obtain (1) A + B, (2) A - B, (3) AC, (4) AB’ ，（ 5) B'A. 







210 Part One Simple Linear Regression 


*5.4. Flavor deterioration. The results shown below were obtained in a small-scale experiment tt> 
study the relation between °F of storage temperature (X) and number of weeks before flavor 
deterioration of a food product begins to occur (7). 


J: _1_2_B_4_5_ 

X,: 8 4 0 -4 -8 

Yi ： 7.8 9.0 10.2 11.0 11‘7 


Assume that first-order regression model (2.1) is applicable. Using matrix methods, find (1) 
Y'Y , ⑵ X'X , ⑶ X'Y. 


5.5. Consumer finance. The data below show, for a consumer finance company operating in six 
cities, the number of competing loan companies operating in the city (X) and the number per 
thousand of the company’s loans made in that city that are curraitly delinquent (7): 

J: _1 2 3 4 5 W 

X；: 4 1 2 3 3 4 

Yi ： 16 5 10 15 13 22 


Assume that first-order regression model (2.1) is applicable. Using matrix methods, find (1) 
Y'Y, (2) X'X , ⑶ X Y , 

*5.6. Refer to Airfreight breakage Problem 1.21. Using matrix methods, find (1) Y’Y, (2) X’X, 
(3) X Y. 

5.7. Refer to Plastic hardness Problem 1.22. Using matrix methods, find (1) Y'Y, (2) X’X, (3) X f Y. 

5.8. Let B be defined as follows: 


'1 5 0" 

B= 1 0 5 
-10 5 - 

a. Are the column vectors of B linearly dependent? 

b. What is the rank of B? 

c. What must be the determinant of B? 

5.9. Let A be defined as follows: 


0 1 8 " 

A= 0 3 1 

5 5. 

a. Are the column vectors of A linearly dependent? 

b. Restate definition (5.20) in terms of row vectors. Are the row vectors of A lineariy dependent? 

c. What is the rank of A? 

d. Calculate the determinant of A. 

5.10. Find the inverse of each of the following matrices ； 


A = 

2 4' 

Q 1 

B = 

■4 

6 

3 

5 

r 

10 


J i 


10 

1 

6. 


Check in each case that the resulting matrix is indeed the inverse. 



V^l IC4pLd 


ivxu-ui\yygvutsn la otrnpie i^tnearKegression /incuysis z 11 


5.11. Find the inverse of the following matrix: 

'5 1 3' 

A= 4 0 5 
.1 9 6 . 

Check that the resulting matrix is indeed the inverse. 

*5.12. Refer to Flavor deterioration Problem 5.4. Find (X’X ) _1 ‘ 

5.13. Refer to Consumer finance Problem 5.5. Find (X'X) _1 . 

*5.14. Consider the simultaneous equations: 

4^i + Vj 2 = 25 
2 Ji + 3^2 = 12 

a. Write these equations in matrix notation. 

b. Using matrix methods, find the solutions for 力 and y 2 . 

5.15. Consider the simultaneous equations: 

5yi + 2^2 = 8 ^ 

23yi -H 7y 2 = 28 

a. Write these equations in matrix notation. 

b. Using matrix methods, find the solutions for 乃 and y 2 . 

5.16. Consider the estimated linear regression function in the form of (1.15). Write expressions in 
this form for the fitted values P,- in matrix terms for i — 1,..., 5. 

5.17 ‘ Consider the following functions of the random variables 1 ^, Y 2t and F 3 : 

W i =Y l -^Y 2 -^ y 3 
W 2 = Y X - Y 2 
W 3 = Fi - F 2 - Y 3 

a. State the above in matrix notation. 

b. Find the expectation of the random vector W. 

c. Find the variance-covariance matrix of W. 

*5.18. Consider the following functions of the random variables y lf 1^ ， I 3 , and F 4 ： 

W! = ](Y l +Y 2 + Y 3 + Y 4 ) 

4 

W 2 =^(Y l -^Y 2 )-^(Y 3 + Y A ) 

a. State the above in matrix notation. 

b. Find the expectation of the random vector W. 
c‘ Find the vadance-covadance matrix of W ‘ 

*5.19. Find the matrix A of the quadratic form: 

37j 2 + 107,7 2 + 177 2 2 

5*20. Find the matrix A of the quadratic form: 

* ， 

lYf- 87^2+ 8 i? 



212 Part One Simple Linear Regression 


*5.21. For the matrix: 

A- P 2 1 

[2 1 

find the quadratic form of the observations Y\ and 7 2 - 
5.22. For the matrix ： 


'1 0 4' 

A= 0 3 0 
4 0 9 


find the quadratic form of the observations 7i, Y 2 , and 7 3 . 

*5.23. Refer to Flavor deterioration Problems 5.4 and 5.12. 

a. Using matrix methods, obtain the following: (1) vector of estimated regE© 茲 ion coefficients, 
(2) vector of residuals, (3) SSR, (4) SSE, (5) estimated variance-covariance matrix of b, 
(6) point estimate of E{Yf t \ when X/, 二 —6, (7) estimated variance of wbenX^ = —6. 

b. What simplifications arose from the spacing of the X levels in the experiment? 

c. Find the hat matrix H. * 

d. Find s 2 {e). 

5.24. Refer to Consumer finance Problems 5.5 and 5.13. 


a. Using matrix methods, obtain the following: (1) vector of estimated regression coefficimts, 
(2) vector of residuals, (3) SSR, (4) SSE, (5) estimated variance-covariance matrix of b, 
(6) point estimate of ^{l*} when 4, ⑺ 5 ,2 {pred) when = 4. 

b. From your estimated variance-covariance matrix in part (a5), obtain the following; 
⑴ b { y, (2) s 2 {bo); (3) 办 i). 

c. Find the hat matrix H. 

d. Find s 2 {e}. 


*5.25. Refer to Airfreight breakage Problems 1.21 and 5.6. 

a. Using matrix methods, obtain the following: (1) (X^' 1 , (2) b, (3) e, (4) H, (5) SSE, 
(6) ^{b}, (7) Y h when X n = 2, (8) 5 2 {7*} when Xh = 2. 

b. From part (a6), obtain the following: (1) 5 2 {i?i); (2) 方 { 如，卜 ）； ⑶ s{b 0 }. 

c. Find the matrix of the quadratic form for SSR. 

5.26. Refer to Plastic hardness Problems 1.22 and 5.7. 

a. Using matrix methods, obtain the following: (1) (^X)' (2) b ，（ 3) f ，（ 4) H ， ⑶ SSE, 
(6) S 2 {b }， ⑺ ^ 2 {pred) when Xh = 30. 

b. From part (a6), obtain the following: (1) ■s 2 {6 0 ); ⑵ ■sp 。， h); (3) 

c. Obtain the matrix of the quadratic form for SSE. 


Exercises 


5.27. Refer to regression-through-the-origiiimodel (4.10). Setup the expectation vector fore. Assume 
that ( 二 1， . .. ， 4. 

5.28. Consider model (4.10) for regression through the origin and the estimator b\ given in (4.14). 
Obtain (4.14) by utilizing (5.60) with X suitably defined. 

5.29. Consider the least squares estimator b given in (5.60). Using matrix methods, show that b is an 
unbiased estimator. 

5.30. Show that Y h in (5.96) can be expressed in matrix terms as b'X /卜 

5.31. Obtain an expression for the variance-covariance matrix of the fitted values Y；,i = 1,..., n ； 
in terms of the hat matrix. 




Part 


llultiple Linear 





ression 


II 




Chapter 




Multiple Regression I 


Multiple regression analysis is one of the most widely used of all statistical methods. In 
this chapter, we first discuss a variety of multiple regression models. Then we present the 
basic statistical results for multiple regression in matrix form. Since the matrix expressions 
for multiple regression are the same as for simple linear regression, we state the results 
without much discussion. We conclude the chapter with an example, illustrating a variety 
of inferences and residual analyses in multiple regression analysis. 

6.1 Multiple Regression Models__ 

Need for Several Predictor Variables 

When we first introduced regression analysis in Chapter 1, we spoke of regression models 
containing a number of predictor variables. We mentioned a regression model where the 
response variable was direct operating cost for a branch office of a consumer finance chain, 
and four predictor variables were considered, including average number of loans outstanding 
at the branch and total number of new loan applications processed by the branch. We also 
mentioned a tractor purchase study where the response variable was volume of tractor 
purchases in a sales territory, and the nine predictor variables included number of farms in 
the territory and quantity of crop production in the territory. In addition, we mentioned a 
study of short children where the response variable was the peak plasma growth hormone 
level, and the 14 predictor variables included gender, age, and various body measurements. 
In all these examples , 过 single predictor variable in the model would have provided an 
' inadequate description since a number of key variables affect the response variable in 

important and distinctive ways. Furthermore, in situations of this type, we frequently find 
that predictions of the response variable based on a model containing only a single predictor 
variable are too imprecise to be useful. We noted the imprecise predictions with a single 
predictor variable in the Toluca Company example in Chapter 2. A more complex model, 
containing additional predictor variables, typically is more helpful in providing sufficiently 
precise predictions of the response variable. 

In each of the examples just mentioned, the analysis was based on observational data be¬ 
cause the predictor variables were not controlled, usually because they were not susceptible 
to direct control. Multiple regression analysis is also highly useful in experimental situations 
where the experimenter can control the predictor variables. An experimenter typically will 
wish to investigate a number of predictor variables simultaneously because almost always 


Chapter 6 Multiple Regression I 215 


more than one key predictor variable influences the response. For example, in a study of 
productivity of work crews, the experimenter may wish to control both the size of the crew 
and the level of bonus pay. Similarly, in a study of responsiveness to a drug, the experimenter 
may wish to control both the dose of the drug and the method of administration. 

The multiple regression models which we now describe can be utilized for either obser¬ 
vational data or for experimental data from a completely randomized design. 

First-Order Model with Two Predictor Variables 

When there are two predictor variables X\ and X 2 , the regression model: 

K = A) + + ^2^t2 + ( 6 . 1 ) 

is called a first-order model with two predictor variables. A first-order model, as we noted 
in Chapter 1, is linear in the predictor variables. Y t denotes as usual the response in the 
ith trial, and Xn and are the values of the two predictor variables In the ith. trial. The 
parameters of the model are j6o, j6i，and j 6 2 , and the error term is e,-. ^ 

Assuming that E{Si} = 0, the regression function for model (6.1) is: 

E{Y} = ^ + ^iX l +^X 2 ( 6 . 2 ) 

y Analogous to simple linear regression, where the regression function E{Y} = j6 0 + is 
a line, regression function (6.2) is a plane. Figure 6.1 contains a representation of a portion 
of the response plane： 

E{Y) =10 + 2X x + 5X 2 (6.3) 

Note that any point on the response plane (6.3) corresponds to the mean response E{Y} at 
the given combination of levels of X\ and X 2 . 

Figure 6.1 also shows an observation Y t corresponding to the levels (Xn, X-a) of the two 
predictor variables. Note that the vertical rule in Figure 6.1 between Yi and the response plane 
represents the difference between Y t and the mean E{Yi} of the probability distribution of 
Y for the given (X^, Xa) combination. Hence, the vertical distance from F； to the response 
plane represents the error term £i = — E[Y[}. 

FIGURE 6.1 

Response 
Function is a 
Plane~Sales 
Pnnnotion 
Example. 








216 Part Two Multiple Unear Regression 


Frequently the regression function in multiple regression is called a regression surface 
or a response surface. In Figure 6.1, the response surface is a plane, but in other cases the 
response surface may be more complex in nature. 


Meaning of Regression Coefficients. Let us now consider the meaning of the regression 
coefficients in the multiple regression function (6.3). The parameter )6。= 10is the Y in¬ 
tercept of the regression plane. If the scope of the model includes X\ = 0, X 2 = 0, then 
j6o = 10 represents the mean response E{Y} at X\ = 0, X 2 = 0. Otherwise, j6o does not 
have any particular meaning as a separate term in the regression model. 

The parameter j6i indicates the change in the mean response E{Y} per unit increase in 
X x when X 2 is held constant. Likewise, j 6 2 indicates the change in the mean response per 
unit increase in X 2 when X\ is held constant. To See this for our exarrg)^, suppose X 2 is 
held at the level X 2 = 2. The regression function (6.3) now is; 

E{Y) = 10 + 2X, + 5(2) = 20 + 2X x X 2 = 2 (6.4) 

Note that this response function is a straight line with slope = 2. The same is true for 
any other value of X 2 ', only the intercept of the response function will differ. Hence, =2 
indicates that the mean response E{Y} increases by 2 with a unit increase in X\ when X 2 is 
constant, no matter what the level of X 2 - We confirm therefore that 爲 indicates the change 
in E{Y} with a unit increase in X ! when X 2 is held constant. 

Similarly, j 6 2 = 5 in regression function (6.3) indicates that the mean response E{Y} 
increases by 5 with a unit increase in X 2 when X 1 is held constant. 

When the effect of X x on the mean response does not depend on the level of X 2 , and 
correspondingly the effect of X 2 does not depend on the level of Xi, the two predictor 
variables are said to have additive effects or not to interact. Thus, the first-order regression 
model (6.1) is designed for predictor variables whose effects on the mean response are 
additive or do not interact. 

The parameters j6i and j 6 2 are sometimes called partial regression coefficients because 
they reflect the partial effect of one predictor variable when the other predictor variable is 
included in the model and is held constant. 


Example 


The response plane (6.3) shown in Figure 6.1 is for a regression model relating test marioet 
sales (F, in 10 thousand dollars) to point-of-sale expenditures (Xi, in thousand dollars) and 
TV expenditures (X 2 , in thousand dollars). Since j6i = 2, if point-of-sale expenditures in 
a locality are increased by one unit (1 thousand dollars) while TV expenditures are held 
constant, expected sales increase by 2 units (20 thousand dollars). Similarly, since 卢 2 = 5, 
if TV expenditures in a locality are increased by 1 thousand dollars and point-of-sale 
expenditures are held constant, expected sales increase by 50 thousand dollars. 


Comments 

1. A regression model for which the response surface is a plane can be used dither in its own right 
when It is appropriate, or as an approximation to a more complex response surface. Many complex 
response surfaces can be approximated well by a plane for limited ranges of X x and X 2 - 



Chapter 6 Multiple Regression I 217 


2. We can readily establish the meaning of and 佐 by calculus, taking partial derivatives of the 
response surface (6.2) with respect to X\ and X 2 in turn ： 

BE{Y) n dE{Y) o 

The partial derivatives measure the rate of change in £{7} with respect to one predictor variable when 
the other is held constant ■ 

First-Order Model with More than Two Predictor Variables 

We consider now the case where there are p — I predictor variables Xi,..., X p _\. The 
regression model: 

K = A) + + ^2^12 + . ■ - + 戶 p -iXi tP _、+ e; (6.5) 

is called a first-order model with p — 1 predictor variables. It can also be written: 

L 

p-i 

K = A) + E J3A + £l - (6.5a) 

k=l 

or, if we let Xio = 1, It can be written as: 

F - 1 

Yi = +£,- where X i0 = I (6.5b) 

k=0 

Assuming that E {e,-} = 0, the response function for regression model (6.5) is: 

E{Y} = + j6 2 X 2 + . ■ • + U P — (6.6) 

This response function is a hyperplane, which is a plane in more than two dimensions. It 
is no longer possible to picture this response surface, as we were able to do in Figure 6.1 
for the case of two predictor variables. Nevertheless, the meaning of the parameters is 
analogous to the case of two predictor variables. The parameter ^ indicates the change In 
the mean response E{Y) with a unit increase in the predictor variable when all other 
predictor variables in the regression model are held constant. Note again that the effect 
of any predictor variable on the mean response is the same for regression model (6.5) no 
matter what are the levels at which the other predictor variables are held. Hence, first- 
order regression model (6.5) is designed for predictor variables whose effects on the mean 
response are additive and therefore do not interact. 

Comment 

When p — 1 = 1, regression model (6.5) reduces to: 

Yi = 卩 0 + 羚 iXil + Gi 

§ 

which is the simple linear regression model considered in earlier chap 物 s- ■ 

General Linear Regression Model 

In general, the variables Xi,..., X p - x In a regression model do not need to represent 
different predictor variables, as we shall shortly see. We therefore define the general linear 



18 Part Two Multiple Linear Regression 


regression model, with normal error terms, simply in terms of X variables ： 

K = A) + ^lX'l + ^2^i2 + • ■ ■ + ^p-\^i,p-\ + Si (6.7) 

where; 

A), A, • • -, ^p-\ are parameters 
Zn, … ， Xi, p -i are known constants 
Si are independent j/V(0, a 2 ) 
i = i, • - • ， n 

If we let Xio = 1, regression model (6.7) can be written as follows: 

Yi ― + i + ^ 2 ^i 2 + ■ - • + ^p-\X lt p-\ + (6.7a) 

where = 1 ， or: 

p—l 

Yi = ^k^ik + where X,o e 1 (6.7b) 

k=o < 

The response function for regression model,(6.7) is，since ffe；} = 0: 

E{Y] = ^ + j6 2 X 2 + ••• + (6.8) 


Thus, the general linear regression model with normal error terms implies that the obser¬ 
vations Yi are independent normal variables, with mean E{Yi) as given by (6.8) and with 
constant variance a 2 . 

This general linear model encompasses a vast variety of situations. We consider a few 
of these now. 

p — 1 Predictor Variables. When … ， X p _\ represent p~l different predictor vari¬ 
ables, general linear regression model (6.7) is, as we have seen, a first-order model in which 
there are no interaction effects between the predictor variables. The example in Figure 6.1 
involves a first-order model with two predictor variables. 


Qualitative Predictor Variables. The general linear regression model (6.7) encompasses 
not only quantitative predictor variables but also qualitative ones, such as gender (male, 
female) or disability status (not disabled, partially disabled，fully disabled) • We use indicator 
variables that take on the values 0 and 1 to identify the classes of a qualitative variable. 

Consider a regression analysis to predict the length of hospital stay (F) based on the age 
(Xi) and gender (X 2 ) of the patient. We define X 2 as follows ： 

(l if patient female 
2 ~ if patient male 

The first-order regression model then is as follows: 

Yi — ^0 + + ^ 2 X 12 + Si (6.9) 

where: 


X n = patient’s age 

if patient female 
0 if patient male 


X i2 



Chapter 6 Multiple Regression I 219 


The response function for regression model (6.9) is: 

E{Y) = j6 0 + + ^iX 2 (6.10) 

For male patients, X 2 = 0 and response function (6.10) becomes: 

E{Y] = j6 0 + AXi Male patients (6.10a) 

For female patients, X 2 = I and response function (6.10) becomes ： 

E{Y} = (j6 0 + j6 2 ) + ^\X x Female patients (6.10b) 


These two response functions represent parallel straight lines with different intercepts. 

In general, we represent a qualitative variable with c classes by means of c — 1 indicator 
variables. For instance, if in the hospital stay example the qualitative variable disability 
status is to be added as another predictor variable, it can be represented as follows by the 
two indicator variables X 3 and X 4 ： 

1 if patient not disabled 
0 otherwise 


L 


X 3 


X4 = 


1 if patient partially disabled 
0 otherwise 


The first-order model with age, gender, and disability status as predictor variables then is: 

K + j6 2 X i2 + ^X i3 + j6 4 X i4 + (6.11) 


where: 


Xn = patient’s age 

if patient female 


X i2 = 


X i3 


Xu 


0 if patient male 

1 if patient not disabled 
0 otherwise 

1 if patient partially disabled 
0 otherwise 


In Chapter 8 we present a comprehensive discussion of how to model qualitative predictor 
variables and how to interpret regression models containing qualitative predictor variables. 

Polynomial Regression. Polynomial regression models are special cases of the general 
linear regression model. They contain squared and higher-order terms of the predictor vari- 
able(s), making the response function curvilinear. The following is a polynomial regression 
model with one predictor variable: 

-Yi = j6 0 + + 賦 + h (6.12) 


Figure 1.3 on page 5 shows an example of a polynomial regression function with one 
predictor variable. 

Despite the curvilinear nature of the response function for regression model (6.12), it is 
a special case of general linear regression model (6.7). If we let Xn — Xt and X t 2 = Xf, 
We can write ( 6 . 12 ) as follows: 



220 Part Two Multiple Linear Regression 


which is in the form of general linear regression model (6.7). While (6.12) illustrates a curvi¬ 
linear regression model where the response function is quadratic, models with higher-degree 
polynomial response functions are also particular cases of the general linear regression 
model. We shall discuss polynomial regression models in more detail in Chapter 8. 

Transformed Variables. Models with transformed variables involve complex, curvilinear 
response functions, yet still are special cases of the general linear regression model. Consider 
the following model with a transformed Y variable ： 

log Yj = ^iXii + ^ 2 ^i 2 + + ( 6 . 13 ) 

Here, the response surface is complex, yet model (6.13) can still be treated as a general 
linear regression model. If we let Y[ — log Y[, we can write regreSsioBiftnodel (6.13) as 
follows: 


which is in the form of general linear regression model (6.7). The response variable just 
happens to be the logarithm of Y. 

Many models can be transformed into the general linear regression model. For instance, 
the model: 


1 


A) + ^l^il + ^2^i2 + 


can be transformed to the general linear regression model by letting Y- 
have ： 


( 6 . 14 ) 

1/^-. We then 


^ = A) + ^l^il + ^2^2 + £i 


Interaction Effects- When the effects of the predictor variables on the response variable 
are not additive, the effect of one predictor variable depends on the levels of the other pre¬ 
dictor variables. The general linear regression model (6.7) encompasses regression models 
with nonadditive or interacting effects. An example of a nonadditive regression model with 
two predictor variables K\ and K 2 is the following: 

^ — ^0 + ^1-^/1 + ^2^i2 + X[2 + £i (6.15) 

Here, the response function is complex because of the interaction term ^XnXn - Yet 
regression model (6.15) is a special case of the general linear regression model. Let X,- 3 = 
XnXn and then write (6.15) as follows: 

Yi = A) + 氏兄 1 + 卢 2 叉 ;2 + ^3^i3 + 

We see that this model is in the form of general linear regression model (6.7). We shaU 
discuss regression models with interaction effects in more detail in Chapter 8. 

Combination of Cases. A regression model may combine several of the elements we have 
just noted and still be treated as a general linear regression model. Consider the following 
regression model containing linear and quadratic terms for each of two predictor variables 
and an interaction term represented by the cross-product term: 

K = A) + 灼知 + ^X 2 n + j6 3 X /2 + + ^XnX i2 +Sf (6.16) 



Chapter 6 Multiple Regression I 221 


FIGURE 6.2 Additional Examples of Response Functions. 



We can then write regression model (6.16) as follows: 


K = A) + AZn + ^2^i2 + 卢 3^3 + + ^sZ i5 + Si 

which is in the form of general linear regression model (6.7). 

The general linear regression model (6.7) includes many complex models, some of which 
may be highly complex. Figure 6.2 illustrates two complex response surfaces when there 
are two predictor variables, that can be represented by general linear regression model (6.7). 

Meaning of Linear in General Linear Regression Model. It should be clear from the 
various examples that general linear regression model (6.7) is not restricted to linear response 
surfaces. The term linear model refers to the fact that model (6.7) is linear in the parameters; 
it does-not refer to the shape of the response surface. 

We say that a regression model is linear in the parameters when it can be written in the 
form: 


Yi = GoA) + ^/l^i + Q2^2 + * •. + Q,p-i^p-i + £/ (6.17) 

where the terms A.o, （ n ， etc., are coefficients involving the predictor variables. For example, 
first-order model (6.1) in two predict^ variables: 

X = A) + PiXii + ^2^i2 + 

is linear in the parameters, with c ( o = 1， Gi = Xi, and 

An example of a nonlinear regression model is the following: 

Yi ^ Xi ) + Si 

This is a nonlinear regression model because it cannot be expressed in the form of (6.17). 
We shall discuss nonlinear regression models in Part III. 


222 Part Two Multiple Linear Regression 


6.2 General Linear Regression Model in Matrix Terms_^ 

We now present the principal results for the general linear regression model (6.7) in matrix 
terms. This model, as noted, encompasses a wide variety of particular cases. The results to 
be presented are applicable to all of these. 

It is a remarkable property of matrix algebra that the results for the general linear regres¬ 
sion model (6.7) in matrix notation appear exactly as those for the simple linear regression 
model (5.57). Only the degrees of freedom and other constants related to the number of X 
variables and the dimensions of some matrices are different. Hence, we are able to present 
the results very concisely. 

The matrix notation, to be sure, may hide enormous computational corQLplexities. To find 
the inverse of a 10 x 10 matrix A requires a tremendous amount of (Stipulation, yet it is 
simply represented as A -1 . Our reason for emphasizing matrix algebra is that it indicates 
the essential conceptual steps in the solution. The actual computations will, in all but the 
very simplest cases, be done by computer. Hence, it does not matter to us whether (X'X)' 1 
represents finding the inverse of a 2 x 2 or a 10 x KTmatrix. The important point is to know 
what the inverse of the matrix represents. ^ 

To express general linear regression model (6.7): 


Yj = ^l^il + ^2^i2 + … + ^p-l^i,p-l + £i 


in matrix terms, we need to define the following matrices: 


(6.18a) (6.18b) 


Y = 

'Yi' 

y 2 

X = 

'1 X" x 12 … 

1 X21 ^22 … 

nxl 

* 

nxp 

* * • * 


人 


置 Xnl ^n 2 * Xn'p-\_ 


(6.18c) (6.18d) 



'J6 0 - 

A 

e J 

£2 

py. 1 

Jp-K 

«xl 

- 


(6.18) 


Note that the Y and e vectors are the same as for simple linear regression. The p vector 
contains additional regression parameters, and the X matrix contains a column of Is as well 
as a column of the n observations for each of the p — lX variables in the regression model. 
The row subscript for each element X ik in the X matrix identifies the trial or case, and the 
column subscript identifies the X variable. 

In matrix terms, the general linear regression model (6.7) is: 


Y = X P + e 

nxl nx/? nX p * 


(6.19) 


where: 


Chapter 6 Multiple Regression l 27^ 


Y is a vector of responses 
P is a vector of parameters 
X is a matrix of constants 

e is a vector of independent normal random variables with expectation 

E{e} = 0 and variance-covariance matrix 

■o* 2 0 

0 a 2 

cr 2 {e) = • • 

nxn : l 

_ 0 0 

Consequently，the random vector Y has expectation: 

E{Y} = Xp (6.20) 

nxl 

y 

and the variance-covariance matrix of Y is the same as that of e: 

a 2 {Y} = a 2 l (6.21) 

«Xrt 

Estimation of Regression Coefficients 



The least squares criterion (1.8) is generalized as follows for general linear regression 
model (6.7): 

n 

Q = 一恥一 - Uw) 2 ( 6 . 22 ) 

/ =1 

The least squares estimators are those values of fo, 卢 ; b ... ， ^p-i that minimize Q. Let us 
denote the vector of the least squares estimated regression coefficients b 0 ,bi, ..., b p -\ as b: 

' b 0 - 
b, 

b — (6.23) 

- -bp—l_ 

The least squares normal equations for the general linear regression model (6.19) are ： 

- XXb = XY - (6.24) 

and the least squares estimators are: 

b = (X'Xy 1 (X f X) Y (6.25) 

2x1 2x2 2x1 


224 Part Two Multiple Linear Regression 


6.4 


The method of maximum likelihood leads to the same estimators for normal error regres¬ 
sion model (6.19) as those obtained by the method of least squares in (6.25). The likelihood 
function in (1.26) generalizes directly for multiple regression as follows: 

1 n ' 

—^3 ~ Po — PiXii - - ^p-i^i.p-i) 2 (6.26) 

f = l 」 

Maximizing this likelihood function with respect to j6o, ， ... ， leads to the estimators 
in (6.25). These estimators are least squares and maximum likelihood estimators and have 
all the properties mentioned in Chapter 1: they are minimum variance unbiased, consistent, 
and sufficient. 

Fitted Values and Residuals 


L(^o 2 ) 


(2 丌 o * 2 )’" 2 


exp 


Let the vector of the fitted values Yi be denoted by Y and the vector of the residual terms 
= Yi — % be denoted by e: 卞 


(6.27a) Y = 

~ v 
y 2 

(6.27b) e = 

'^r 

^2 

«xl 

X. 

«x 1 

- e n _ 


The fitted values are represented by: 



nxl 


=Xb 


and the residual terms by: 


e = Y-Y = Y-Xb 

nx 1 


(6.27) 


(6.28) 


(6.29) 


_ n 

The vector of the fitted values Y can be expressed in terms of the hat matrix H as follows: 

Y =HY (6.30) 

nxl 

where: 

H = X(X f X)~ l X f (6.30a) 


Similarly, the vector of residuals can be expressed as follows: 

e = (I - H)Y (6.31) 

rtxl 

The variance-covariance matrix of the residuals is: 

(6.32) 

«X« 



which is estimated by: 


Chapter 6 Multiple Regression / 225 


s 2 

nxn 


{e} = MSE(l - H) 


(6.33) 


6.5 Analysis of Variance Results 


Sums of Squares and Mean Squares 

The sums of squares for the analysis of variance in matrix terms are, from (5.89): 


TABLE 6.1 

ANOVA Table 
for General 
Linear 
Regression 
Model (6.19). 


SSTO = YY- 


G) 


Y JY = Y 


- G) 


j 


Y 


SSE = e，e = (Y — Xb)'(Y — Xb) = Y Y - b X Y = Y (I- H)Y 


SSR = bXY 


-G) 


YJY = Y 


h-G) 


J 


Y 


(6.34) 

(635) 

^ 636 ) 


where J is an « x « matrix of Is defined in (5.18) and H is the hat matrix defined in (6.30a). 

SSTO^ as usual, has n — l degrees of freedom associated with it. SSE has n — p degrees 
of freedom associated with it since p parameters need to be estimated in the regression 
function for model (6.19). Finally, SSR has p — 1 degrees of freedom associated with it, 
representing the number of X variables , X p _i. 

Table 6.1 shows these analysis of variance results, as well as the mean squares MSR and 
MSE: 

SSR (637) 


MSR 

MSE 


P - 1 
SSE 
n — p 


(6.38) 


The expectation of MSE is a 2 , as for simple linear regression. The expectation of MSR 
is a 2 plus a quantity that is nonnegative. For instance, when p — 1 = 2, we have ： 

e{msr] = ^ 2 +\ [fi - A) 2 + 拉 S ( 兄 2 一 无 2)2 

+ 2 j 6 , j 6 2 - ^0(X i2 -X 2 ) 

Note that if both and ^2 equal zero, E{MSR} = a 2 . Otherwise E{MSR} > o* 2 . 


Source of 

J 



Variation 

55. 

df 

MS 

Regression 

55/? = b , X / Y- ⑴ ,Y，JY 

卜 1 

MSR 一 SSR 
P-1 


* 


- 55£ 

Error 


n~ p 

MSE- 


/ -| V 

n-p 

Total 

OTO = Y ， Y— ( — ‘) Y，JY 

n — i 




226 Part Two Multiple Linear Regression 


F Test for Regression Relation 

To test whether there is a regression relation between the response variable Y and the set of 
X variables Xi ， ... ， X p —i, i.e.，to choose between the alternatives: 


//o ： jSi = ^2 = = ^p-l = 0 

H a : not all (k = 1，...，/?— 1) equal zero 

we use the test statistic: 

— MSE 

The decision rule to control the Type 1 error at a is: 

If F* < F(1 — a \ p — l,n — p), conclude Ho v 

If F* > F(1 — a', p — l,n ~ p), conclude H a 


(6.39a) 


(6.39b) 


(6.B9c) 


The existence of a regression relation by itself does not, of course, ensure that useful 
predictions can be made by using it. ^ 

Note that when p — 1 = 1 ， this test reduces to the F test in (2.60) for testing in simple 
linear regression whether or not = 0. 


Coefficient of Multiple Determination 

The coefficient of multiple determination, denoted by R 2 , is defined as follows: 

SSR . SSE 


R 2 = 


=1 


(6.40) 


SSTO SSTO 

It measures the proportionate reduction of total variation in Y associated with the use of the 
set of X variables Xi ,The coefficient of multiple determination R 2 reduces to the 
coefficient of simple determination in (2.72) for simple linear regression when p — 1 = 1 ， 
i.e., when one X variable is in regression model (6.19). Just as before, we have: 

0<R 2 < l (6.41) 

where R 2 assumes the value 0 when all = 0(k = 1， •.. ， p — 1)，and the value 1 when 
all Y observations fall directly on the fitted regression surface, i.e., when Y t = Y/ for all i. 

Adding more X variables to the regression model can only increase R 2 and never reduce 
it, because SSE can never become larger with more X variables and SSTO is always the 
same for a given set of responses. Since R 2 usually can be made lai^er by including a larger 
number of predictor variables, it is sometimes suggested that a modified measure be used 
that adjusts for the number of K variables in the model. The adjusted coefficient of multiple 
determination, denoted by adjusts R 2 by dividing each sum of squares by its associated 
degrees of freedom: 

SSE 

= = l 邑 (6.42) 

SSTO \n — p J 


SSTO 


n — 


This adjusted coefficient of multiple determination may actually become smaller when 
another X variable is introduced into the model, because any decrease in SSE may be more 
than offset by the loss of a degree of freedom in the denominator n — p. 



Comments 


Chapter 6 Multiple Regression I 227 


1. To distinguish between the coefficients of determination for simple and multiple regression, 
we shall from now on refer to the former as the coefficient of simple determination. 

2. It can be shown that the coefficient of multiple determination R 2 can be viewed as a coefficient 
of simple determination between the responses and the fitted values Yj. 

3. A large value of R 2 does not necessarily imply that the fitted model is a useful one. For instance, 

observations may have been taken at only a few levels of the predictor variables. Despite a high R 2 
in this case, the fitted model may not be useful if most predictions require extrapolations outside the 
region of observations. Again, even though R 2 is large, MSE may still be too large for inferences to 
be useful when high precision is required. ■ 

Coefficient of Multiple Correlation 

The coefficient of multiple correlation R is the positive square root of R 2 : 

R = l (6.4Z) 

When there is one X variable in regression model (6.19), i.e., when /?— 1 = 1， the coefficient 
of multiple correlation R equals in absolute value the correlation coefficient r in (2,73) for 
y simple correlation, 

6.6 Inferences about Regression Parameters_ 


The least squares and maximum likelihood estimators in b are unbiased: 




E{b} = p 


(6.44) 

The variance-covariance matrix or 2 {b}: 





■ cr 2 {b 0 } 

cr{b 0 , b { ) … 

<y{bo, bp^x) 


or 2 {b} = 

p x p 

cr{bub 0 } 

cr 2 {bi} … 

a{b u b p ^i} 

(6.45) 


a{b p - U bo\ 

a{b p -i,bi) 

cr 2 {b p -i} 


is given by: 






a 2 {b) = cr 2 (X f X)~ l 

P^P 


(6.46) 

The estimated variance-covariance matrix s 2 {b}: 




AM 

f bx) … 

冲 0, b p -x} 


s 2 {b) - 

pxp 

Hb'l, b 0 ) 

• * 

sib^bp-i) 

(6.47) 


_s{b p - u b 0 y 

s{b p -i,bi} *** 

S 2 {b p -i} 


is given by: 

s 2 {b} 

nv n 

= MSEiX'Xy 1 


(6.48) 


P 父 P 



228 Part Two Multiple Linear Regression 


From s 2 {b}, one can obtain s 2 {bo}, s 2 {b\} t or whatever other variance is needed, or any 
needed covariances. 


Interval Estimation of pk 

For the normal error regression model (6.19)，we have ： 


bk -卩 k 

s{b k ) 


t{n~ p) k = 0,1,... y p — l 


Hence, the confidence limits for with 1 — a confidence coefficient are: 


b k ± r(l - a/2; n - p)s{b k ) 


(6.49) 


(6.50) 


Tests for fik 

Tests for ^ are set up in the usual fashion. To test: 


we may use the test statistic: 


and the decision rule ： 


H 0 ： 汍 = 0 

H a : P k ^ 0 卞 



h 

s{b k ) 


If \t*\ < r(l — a/2：n — p), conclude Hq 
Otherwise conclude H a 



(6.51a) 


(6.51b) 


(6.51c) 


The power of the t test can be obtained as explained in Chapter 2, with the degrees cf 
freedom modified to n — p. 

As with simple linear regression, an F test can also be conducted to determine whether 
or not 氏 = 0 in multiple regression models. We discuss this test in Chapter 7. 


Joint Inferences 

The Bonferroni joint confidence intervals can be used to estimate several regression co¬ 
efficients simultaneously. If g parameters are to be estimated jointly (where g < p), the 
' confidence limits with family confidence coefficient 1 — a are: 

b k ± Bs{b k ) (6.52) 

where: 

B = r(l — a/2g\ n — p) (6.52a) 

In Chapter 7, we discuss tests concerning subsets of the regression parameters. 



Chapter 6 Multiple Regression 1 229 


6_7 Estimation of Mean Response and Prediction 
of New Observation 

- - — ~ ' ~ ' 一 

Interval Estimation of E{Y h } 

For given values of , X p - U denoted by Xhi, ,.. ， X hp -i, the mean response is 

denoted by E{Yk}. We define the vector X ，,： 


X h = 

pxl 


X h i 

Xh ， p~l 


so that the mean response to be estimated is; 


E{Y h } = 

The estimated mean response corresponding to X^, denoted by Yh, is: 

Y h = X；b 

y This estimator is unbiased; 

E{%} = X^P = E{Y h } 

and its variance is: 

a 2 {Y h ) = cr^iXX)-^ 


(6.53) 


l (6_54) 


(6.55) 

(6.56) 

(6.57) 


This variance can be expressed as a function of the variance-covariance matrix of the 
estimated regression coefficients: 

<r 2 {Y h } = X^or 2 {b}X, (6.57a) 

Note from (6.57a) that the variance c 2 {i^}isa function of the variances cr 2 {b k } of the regres¬ 
sion coefficients and of the covariances cr{b k , b^} between pairs of regression coefficients, 
just as in simple linear regression. The estimated variance s 2 {Y h } is given by; 

s 2 {Y h } = MSEiX^OCX)-^) = X ； s 2 {b}X^ (6.58) 

The l ~ a confidence limits for E{Y h } are: 

么土 r(l - a/2; n - p)s{Y h ) (6.59) 


Confidence Region for Regression Surface J 

The l—a confidence region for the entire regression surface is an extension of the Working- 
Hotelling confidence band (2.40) for the regression line when there is one predictor variable. 
Boundary points of the confidence region at Xh are obtained from ： 

y h rbWs{V h } (6.60) 



230 Part Two Multiple Linear Regression 


where: 

W 2 = pF(\ - a; P, n - p) (6.60a) 

The confidence coefficient 1 — a provides assurance that the region contains the entire 
regression surface over all combinations of values of the X variables. 

Simultaneous Confidence Intervals for Several Mean Responses 

To estimate a number of mean responses E{Y n \ corresponding to different X" vectors with 
family confidence coefficient 1 — a. we can employ two basic approaches: 

1. Use the Working-Hotelling confidence region bounds (6.60) for the several X h vectors 

of interest: 々 

W 5 {n} (6.61) 

八八 

where Y h , W, and 5{}",•,} are defined in (6.55), (6.60a), and (6.58), respectively. Since the 
Working-Hotelling confidence region covers the mean responses for all possible vec¬ 
tors with confidence coefficient 1 — a, the selected boundary values will cover the mean 
responses for the X；, vectors of interest with family confidence coefficient greater than 1 —a. 

2. Use Bonferroni simultaneous confidence intervals. When g interval estimates are to 
be made, the Bonferroni confidence limits are; 

Y it ±Bs{Y h } (6.62) 

where: 

B = r(l — a/2g\ n — p) (6.62a) 

For any particular application, we can compare the W and B multiples to see which 
procedure will lead to narrower confidence intervals. If the X A levels are not specified in 
advance but are determined as the analysis proceeds, it is better to use the Working-Hotelling 
limits (6.61) since the family for this procedure includes all possible levels. 

Prediction of New Observation ^( new) 

The 1 — a prediction limits fora new observation F/,( new ) corresponding to X；,, the specified 
, values of the X variables, are: 

Yfj ± f(l — a/2\n — ^»)s{pred} (6.63) 

where: 

s 2 {pred] = MSE + s 2 {Y h } = MSE(l + X^X^-'X,,) (6.63a) 

and 5 2 {K/,| is given by (6.58). 

Prediction of Mean of m New Observations at Xh 

When m new observations are to be selected at the same levels X；, and their mean f/,( ne w) is 
to be predicted, the 1 ~ a prediction limits are: 

± r (1 — a/2\ n — /?)a {predmean} (6.64) 



Chapter 6 Multiple Regression l 231 


where: 

^ 2 {predmean} = + s 2 {%} = MSE (- + XUX^-'X^ ] (6.64a) 

m \m J 

Predictions of g New Observations 

Simultaneous Scheffe prediction limits for g new observations at g different levels with 
family confidence coefficient 1 — a are given by: 

Y h ± ^{pred} (6.65) 

where: 

S 2 = gF(l - a;g t n- p) ; (6.65a) 

and i ,2 {pred} is given by (6.63a), 

Alternatively, Bonferroni simultaneous prediction limits can be used. For g predictions 
with family confidence coefficient 1 — a, they are: i 

Y h ± Bi'jpred} (6.66) 

where; 

/ B = r(l — a/2g; n — p) (6.66a) 

A comparison of S and B in advance of any particular use will indicate which procedure 
will lead to narrower prediction intervals. 

Caution about Hidden Extrapolations 

When estimating a mean response or predicting a new observation in multiple regression, 
one needs to be particularly careful that the estimate or prediction does not fall outside the 
scope of the model. The danger, of course, is that the model may not be appropriate when it 
is extended outside the region of the observations. In multiple regression, it is particularly 
easy to lose track of this region since the levels of Xi,..., X p -i jointly define the region. 
Thus, one cannot merely look at the ranges of each predictor variable. Consider Figure 6.3, 


FIGURE 6.3 

R 明 ion of 
Observations 
on Jfi and X 2 
Jmnliy, 

Cmnpared with 
Ranges of X x 

and 

Individually. 



0 


Xi 







232 Part Two Multiple Linear Regression 


where the shaded region is the region of observations for a multiple regression application 
with two predictor variables and the circled dot represents the values (X/,j, X,^) for which 
a prediction is to be made. The circled dot is within the ranges of the predictor variables 
Xi and X 2 individually, yet is well outside the joint region of observations. It is easy to 
spot this extrapolation when there are only two predictor variables, but it becomes much 
more difficult when the number of predictor variables is large. We discuss in Chapter 10 
a procedure for identifying hidden extrapolations when there are more than two predictor 
variables. 


6.8 Diagnostics and Remedial Measures 

Diagnostics play an important role in the development and evaluation of multiple regression 
models. Most of the diagnostic procedures for simple linear regression that we described in 
Chapter 3 carry over directly to multiple regression. We review these diagnostic procedures 
now, as well as the remedial measures for simple linear regression that carry over directly 
to multiple regression. 

Many specialized diagnostics and remedial procedures for multiple regression have also 
been developed. Some important ones will be discussed in Chapters 10 and 11. 

Scatter Plot Matrix 

Box plots, sequence plots, stem-and-Ieaf plots, and dot plots for each of the predictor vari¬ 
ables and for the response variable can provide helpful, preliminary univariate information 
about these variables. Scatter plots of the response variable against each predictor variable 
can aid in determining the nature and strength of the bivariate relationships between each of 
the predictor variables and the response variable and in identifying gaps in the data points as 
well as outlying data points. Scatter plots of each predictor variable against each of the other 
predictor variables are helpful for studying the bivariate relationships among the predictor 
variables and for finding gaps and detecting outliers. 

Analysis is facilitated if these scatter plots are assembled in a scatter plot matrix, such 
as in Figure 6.4. In this figure, the Y variable for any one scatter plot is the name found in 


FIGURE 6.4 
SYGRAPH 
Scatter Plot 
Matrix and 
Correlation 
Matrix — 
Dwaine Studios 
Example. 


(a) Scatter Plot Matrix 




a 

SALES 

ft 

o 

o 

o 

® % 

o 

o 

a 

* » * « 

O 



<l 


TARGTPOP 

o 

° o 

e 

* o 

w o 

o 0 

c 

c- 、 

o 

% ❶ 6 

et? 

% oC 

DISPOINC 


(b) Correlation Matrix 


SALES TARGTPOP DISPOINC 


SALES 1.000 .945 .836 

TARGTPOP 1 .000 .781 

DISPOINC 1.000 


Chapter 6 Multiple Regression I 233 


its row, and the X variable is the name found in its column. Thus, the scatter plot matrix in 
Figure 6,4 shows in the first row the plots of Y (SALES) against X\ (TARGETPOP) and 
X 2 (DISPOINC), of K\ against Y and X 2 in the second row, and of X 2 against Y and X 1 
in the third row. These variables are described on page 236, Alternatively, by viewing the 
first column, one can compare the plots of K\ and X 2 each against Y, and similarly for the 
other two columns. A scatter plot matrix facilitates the study of the relationships among 
the variables by comparing the scatter plots within a row or a column. Examples in this and 
subsequent chapters will illustrate the usefulness of scatter plot matrices. 

A complement to the scatter plot matrix that may be useful at times is the correlation ma¬ 
trix. This matrix contains the coefficients of simple correlation r yi , r Y2 ,..., ry, p _i between 
Y and each of the predictor variables, as well as all of the coefficients of simple correlation 
among the predictor variables —— r x i between X\ and X 2 , ru between K\ and X 3 , etc. The 
format of the correlation matrix follows that of the scatter plot matrix: 






i 

1 

rn 

rYi 

… r Y, P -\ 


t'Yl 

1 

厂 12 

… r Up-l 

(6.67) 

r Kp-i 

r \,P-l 

r'l.p-i 

… 1 



Note that the correlation matrix is symmetric and that its main diagonal contains Is because 
the coefficient of correlation between a variable and itself is 1. Many statistics packages 
provide the correlation matrix as an option. Since this matrix is symmetric, the lower (or 
upper) triangular block of elements is frequently omitted in the output. 

Some interactive statistics packages enable the user to employ brushing with scatter plot 
matrices. When a point in a scatter plot is brushed, it is given a distinctive appearance on the 
computer screen in each scatter plot in the matrix. The case corresponding to the brushed 
point may also be identified. Brushing is helpful to see whether a case that is outlying in 
one scatter plot is also outlying in some or all of the other plots. Brushing may also be 
applied to a group of points to see, for instance, whether a group of cases that does not fit 
the relationship for the remaining cases in one scatter plot also follows a distinct pattern in 
any of the other scatter plots. 


Three-Dimensional Scatter Plots 

Some interactive statistics packages provide three-dimensional scatter plots or point clouds, 
and permit spinning of these plots to enable the viewer to see the point cloud from different 
perspectives. This can be very helpful for identifying patterns that are only apparent from 
certain perspectives. Figure 6.6 (5n page 238 illustrates a three-dimensional scatter plot and 
the use of spinning, 1 

Residual Plots 

A plot of the residuals against the fitted values is useful for assessing the appropriateness of 
the multiple regression function and the constancy of the variance of the error terms, as well 
as for providing information about outliers, just as for simple linear regression. Similariy, 



234 Part Two Multiple Linear Regression 


a plot of the residuals against time or against some other sequence can provide diagnostic 
information about possible correlations between the error terms in multiple regression. Box 
plots and normal probability plots of the residuals are useful for examining whether the 
error terms are reasonably normally distributed. 

In addition, residuals should be plotted against each of the predictor variables. Each of 
these plots can provide further information about the adequacy of the regression function 
with respect to that predictor variable (e.g., whether a curvature effect is required for that 
variable) and about possible variation in the magnitude of the error variance in relation to 
that predictor variable. 

Residuals should also be plotted against important predictor variables that were omitted 
from the model, to see if the omitted variables have substantial additi^al effects on the 
response variable that have not yet been recognized in the regression model. Also, residuals 
should be plotted against interaction terms for potential interaction effects not included in 
the regression model, such as against X 1 X 2 , Xi X 3 , and X 2 X 3 , to see whether some or all 
of these interaction terms are required in the model. 

A plot of the absolute residuals or the squared residuals against the fitted values is useful 
for examining the constancy of the variance of the error terms. If nonconstancy is detected, a 
plot of the absolute residuals or the squared residuals against each of the predictor variables 
may identify one or several of the predictor variables to which the magnitude of the error 
variability is related. 

Correlation Test for Normality 

The correlation test for normality described in Chapter 3 carries forward directly to multiple 
regression. The expected values of the ordered residuals under normality are calculated 
according to (3.6), and the coefficient of correlation between the residuals and the expected 
values under normality is then obtained. Table B .6 is employed to assess whether or not 
the magnitude of the correlation coefficient supports the reasonableness of the normality 
assumption. 

Brown-Forsythe Test for Constancy of Error Variance 

The Brown-Forsythe test statistic (3.9) for assessing the constancy of the error variance can 
be used readily in multiple regression when the error variance increases or decreases with 
one of the predictor variables. To conduct the Brown-Forsythe test, we divide the dataset 
into two groups, as for simple linear regression, where one group consists of cases where 
the level of the predictor variable is relatively low and the other group consists of cases 
where the level of the predictor variable is relatively high. The Brown-Forsythe test then 
proceeds as for simple linear regression. 

Breusch-Pagan Test for Constancy of Error Variance 

The Breusch-Pagan test (3.11) for constancy of the error variance in multiple regression is 
carried out exactly the same as for simple linear regression when the error variance increases 
or decreases with one of the predictor variables. The squared residuals are simply regressed 
against the predictor variable to obtain the regression sum of squares SSR\ and tte test 
proceeds as before, using the error sum of squares SSE for the full multiple regression 
model. 



Chapter 6 Multiple Regression I 235 


When the error variance is a function of more than one predictor variable, a multiple 
regression of the squared residuals against these predictor variables is conducted and the 
regression sum of squares SSR* is obtained. The test statistic again uses SSE for the full 
multiple regression model, but now the chi-square distribution involves q degrees of free¬ 
dom, where q is the number of predictor variables against which the squared residuals are 
regressed. 


f Test for Lack of Fit 

The lack of fit F test described in Chapter 3 for simple linear regression can be carried over 
to test whether the multiple regression response function: 4 ' 

E{Y} = Po + ^X l + ---+^ l X p ^ l 

. ^ 

is an appropriate response surface. Repeat observations in multiple regression are replicate 

observations on Y corresponding to levels of each of the X variables that are constant from 

trial to trial. Thus, with two predictor variables, repeat observations require that K\ and X 2 

each remain at given levels from trial to trial. 

y Once the ANOVA table, shown in Table 6.1, has been obtained, SSE is decomposed into 
pure error and lack of fit components. The pure error sum of squares SSPE is obtained by first 
calculating for each replicate group the sum of squared deviations of the Y observations 
around the group mean, where a replicate group has the same values for each of the X 
variables. Let c denote the number of groups with distinct sets of levels for the X variables, 
and let the mean of the Y observations for the jth group be denoted by Yj. Then the sum 
of squares for the jth group is given by (3.17), and the pure error sum of squares is the sum 
of these sums of squares, as given by (3.16). The lack of fit sum of squares SSLF equals the 
difference SSE — SSPE, as indicated by (3.24). 

The number of degrees of freedom associated with SSPE is « — c, and the number of 
degrees of freedom associated with SSLF is (n — p) — (n — c) = c — p. Thus, for testing 
the alternatives: 


Ho ： E{Y }= 汍 + + … + Uh 

Ha- E{Y} ^ ^0 + ^1-^1 + * * ■+ ^p-l^p-l 

the appropriate test statistic is; 

丸 SSLF SSPE MSLF 

f* =_ : _ :— _ 

c ~ p ' n — c MSPE 


(6.68a) 


(6.68b) 


where SSLF and SSPE are given by (3.24) and (3.16), respectively, and the appropriate 
decision rule is: 


If F* < jP(1 — a\c — p，n — f), conclude Ho 
If F* > F(1 ~ or,c — p，n — c), conclude H a 

m 

Comment • 


(6.68c) 


When replicate observations are not available, an approximate lack of fit test can be conducted 
if there are cases that have similar X/, vectors. These cases are grouped together and treated as 
pseudoreplicates, and the test for lack of fit is then carried out using these groupings of similar 



236 Part Two Multiple Linear Regression 


Remedial Measures 

The remedial measures described in Chapter 3 are also applicable to multiple regression. 
When a more complex model is required to recognize curvature or interaction effects, the 
multiple regression model can be expanded to include these effects. For example, X\ might 
be added as a variable to take into account a curvature effect of X 2 , or X\X 3 might be 
added as a variable to recognize an interaction effect between X\ and X 3 on the response 
variable. Alternatively, transformations on the response and/or the predictor variables can 
be made, following the principles discussed in Chapter 3, to remedy model deficiencies. 
Transformations on the response variable Y maybe helpful when the distributions of the error 
terms are quite skewed and the variance of the error terms is not constant. Transformations 
of some of the predictor variables may be helpful when the effectsthese variables are 
curvilinear. In addition, transformations on Y and/or the predictor variables may be helpful 
in eliminating or substantially reducing interaction effects. 

As with simple linear regression, the usefulness of potential transformations needs to be 
examined by means of residual plots and other diagnostic tools to determine whether the 
multiple regression model for the transformed data is appropriate. 

Box-Cox Transformations. The Box-Cox procedure for determining an appropriate 
power transformation on Y for simple linear regression models described in Chapter 3 
is also applicable to multiple regression models. The standardized variable W in (3.36) is 
again obtained for different values of the parameter A and is now regressed against the set 
of X variables in the multiple regression model to find that value of 入 that minimizes the 
error sum of squares SSE. 

Box and Tidwell (Ref. 6.1) have also developed an iterative approach for ascertaining 
appropriate power transformations for each predictor variable in a multiple regression model 
when transformations on the piBdictor variables may be required. 

6.9 An Example — Multiple Regression with Two 
Predictor Variables 

In this section, we shall develop a multiple regression application with two predicts* vari¬ 
ables. We shall illustrate several diagnostic procedures and several types of inferences that 
might be made for this application. We shall set up the necessary calculations in matrix 
、 format but, for ease of viewing, show fewer si gnificant digits for the elements of the matrices 

than are used in the actual calculations. 

Setting 

Dwaine Studios, Inc., operates portrait studios in 21 cities of medium size. These studios 
specialize in portraits of children. The company is considering an expansion into other 
cities of medium size and wishes to investigate whether sales (F) in a community can be 
predicted from the number of persons aged 16 or younger in the community (Xj) and the 
per capita disposable personal income in the community (X 2 ). Data on these variables for 
the most recent year for the 21 cities in which Dwaine Studios is now operating are shown 
in Figure 6.5b. Sales are expressed in thousands of dollars and are labeled Y or SALES; 
the number of persons aged 16 or younger is expressed in thousands of persons and is 




Chapter 6 Multiple Regression I 237 


FIGURE 6.5 


(a) Multiple Regression Output 





(b) 

Basic Data 


<5VSTAT 

DEP VAR: SALES K ： 21 MULTIPLE R: 0.957 SQUARED MULTIPLE R: 



CASE 

XI 

X2 

Y 

FITTED 

RESIDUAL 

ox 



0.917 



1 

68.5 

16.7 

174.4 

187.184 

-12.7841 

Multiple 

ADJUSTED SQUARED MULTIPLE R: .907 STANDARD ERROR OF ESTIMATE ： 



2 

45.2 

16.8 

164.4 

154-229 

10-1706 

• 



11.0074 



3 

91.3 

18.2 

244-2 

234.396 

9-8037 

Regression 






4 

47-8 

16.3 

154.6 

153.329 

1.2715 

Output and 






S 

46.9 

17.3 

X81.6 

161.335 

20.2151 

■ 






6 

66.1 

18.2 

207-5 

197.741 

9.7586 

Basic 






7 

49.5 

15.9 

152.8 

152.055 

0.7449 

Data—Dwaine 

VARIABLE 

COEFFICIENT STD ERROH 

STD COEF TOLERANCE 

T 

PC2 TAIL) 

8 

52-0 

17.2 

163.2 

167.867 

-4.6666 






9 

48.9 

16.6 

145.4 

167.738 

-12.33S2 

Studios 

CQHSTAWT 

-68.8671 60.0170 

0.0000 , 

-1.1473 

0.2663 

10 

38-4 

16.0 

137.2 

136.846 

0,3540 

TTvfnnDiB. 

TARGTPOP 

1.4546 0.2118 

0.74S4 0-3896 

6.8682 

0.0000 

11 

87.9 

18-3 

241.9 

230-387 

11-5126 


DISPOIWC 

9.3655 4.0640 

0.2511 0.3896 

2.3045 

0.0333 

12 

72,8 

17.1 

191.1 

197.185 

-6.0S49 







13 

88.4 

17.4 

232.0 

222.686 

9-3143 







lT 

42.9 

15.8 

145.3 

141.518 

3.7816 







IS, 

52,5 

17-8 

161-1 

174.213 

-13.1132 



ANALYSIS OF VARIANCE 



16 

85.7 

18.4 

209.7 

228.124 

-18.4239 







17 

41.3 

16.5 

146,4 

145-747 

0-6S30 


SOURCE 

sum-of-squaees df 

MEAN-SQUARE F-RATIO 

P 

18 

51.7 

16.3 

144. Oj 

^159.001 

-15.0013 







19 

89.6 

18.1 

232.6 

230-987 

1.6130 


REGRESSION 

24015.2821 2 

12007-6411 99, 

1035 0 

1.0000 

20 

82-7 

19.1 

224.1 

230.316 

-6.2160 


RESIDUAL 

2180.9274 18 

121.1626 



21 

52.3 

16.0 

166.5 

157.064 

9.4356 


INVERSE CX r X) 

12 3 

1 29.7289 

2 0*0722 0.00037 

3 -1.9926 -0.0056 0.1363 


labeled X\ or TARGTPOP for target population; and per capita disposable personal income 
is expressed in thousands of dollars and labeled X 2 or DISPOINC for disposable income. 
The first-order regression model: 

Yi = + Pl^il + ^2-^/2 + (6.69) 

with normal error terms is expected to be appropriate, on the basis of the SYGRAPH 
scatter plot matrix in Figure 6.4a. Note the linear relation between tai^get population and 
sales and between disposable income and sales. Also note that there is more scatter in the 
latter relationship. Finally note that there is also some linear relationship between the two 
predictor variables. The correlation matrix in Figure 6.4b bears out these visual impressions 
from the scatter plot matrix. 

A SYGRAPH plot of the point cloud is shown in Figure 6.6a. By spinning the axes, We 
obtain the perspective in Figure 6.6b which supports the tentative conclusion that a response 
plane may be a reasonable regression function to utilize here. 

f 

Basic Calculations - 

The X and Y matrices for the Dwaine Studios example are as follows: 


X = 

1 68.5 16.7 

1 45.2 16.8 

Y = 

' 174.( 

164.4 

(6.70) 


1 52.3 16.0 


.166.5. 




238 Part Two Multiple Linear Regression 


FIGURE 6.6 SYGRAPH Plot of Point Cloud before and after Spinning~DwaiBe Studios Exa 

(a) Before Spinning (b) After Spinnirt 




We require: 


which yields: 


2. 


which yields ： 



1 

1 … 

1 


x'x = 

68.5 

45.2 … 

52.3 



16.7 

16.8 … 

16.0 



n 68.5 16. 

1 45.2 16. 

1 52.3 16. 


" 21.0 1,302.4 360.0" 

X'x = 1,302.4 87,707.9 22,609.2 


360.0 22,609.2 6,190.3 


XY= 68.5 45.2 … 52.3 
16.7 16.8 … 16.0 


174.4 

164.4 

166.5 


XY = 


3,820 

249,643 

66,073 



3. 


Chapter 6 Multiple Regression I 239 


(x'xr 1 

Using (5.23)，we obtain: 

(X'X) 一 1 = 


21.0 1,302.4 360.0. 

1,302.4 87,707.9 22,609.2 
360.0 22,609.2 6,190.3 


29.7289 .0722 -1.9926 

.0722 .00037 —.0056 

— 1.9926 -.0056 .1363 

•itr 


(6.73) 


Algebraic Equivalents. Note that X’X for the first-order regression model (6.69) with 
two predictor variables is: ^ 


or: 



r i 

1 



"1 X,! X l2 

x'x = 

i i 

X 21 

•… 1 

…心 


1 x 21 X22 


_^12 X22 

• • • X n 2 - 


_ 1 X n \ X n 2 _[ 



n 




x，x 



E 硌 

Y ： X n X i2 




X 知 

E x i2 x n 

^12 _ 



(6.74) 


For the Dwaine Studios example, we have ： 

n = 2\ 

= 68.5+45.2 + = 1,302.4 

J2 x n x i2 = 68.5(16.7)+ 45.2(16.8) + -- 
etc. 


22,609.2 


These elements are found in (6.71). 

Also note that X'Y for the first-order regression model (6.69) with two predictor 
variables is; 


(6.75) 



X 1 

Xu X21 

… 1 

V •欠 nl 


"F." 

r 2 


'" 


_^12 - ^22 

. .• X n 2 - 

a 

Jn ： 




For the Dwaine Studios example, we have ： 

174.4+*164.4 + … = 3,820 

J2 x n Y i = 68.5(174.4)+45.2(164.4) + 
= 16.7(174.4) + 16.8(164.4) + 
These are the elements found in (6.72). 


= 249,643 
= 66,073 



240 Part Two Multiple Linear Regression 


Estimated Regression Function 

The least squares estimates b are readily obtained by (6.25), using our basic calculations 
in (6.72) and (6.73): 



' 29.7289 .0722 

-1.9926" 


' 3,820" 

b = (XHY = 

.0722 .00037 

-1.9926 —.0056 

— .0056 

1363 


249,643 

66,073 


which yields: 



V 


*-68.857" 


b = 

bi 

— 

1.455 

(6.76) 


b 2 _ 


9.366 



and the estimated regression function is: 

Y = —68.857 + 1.455X, + 9.366X 2 

•卞 

A three-dimensional plot of the estimated regression function, with the responses super¬ 
imposed, is shown in Figure 6.7. The residuals are represented by the small vertical lines 
connecting the responses to the estimated regression surface. 

This estimated regression function indicates that mean sales are expected to increase by 
1.455 thousand dollars when the target population increases by 1 thousand persons aged 
16 years or younger, holding per capita disposable personal income constant, and that mean 
sales are expected to increase by 9.366 thousand dollars when per capita income increases 
by 1 thousand dollars, holding the target population constant. 

Figure 6.5a contains SYSTAT multiple regression output for the Dwaine Studios exam¬ 
ple. The estimated regression coefficients are shown in the column labeled COEFFICIENT; 
the output shows one more decimal place than we have given in the text. 

The SYSTAT output also contains the inverse of the X f X matrix that we calculated 
earlier; only the lower portion of the symmetric matrix is shown. The results are the same 
as in (6.73). 


FIGURE 6J 

300 - 

S-Pius Plot of 


Estimated 


Regression 

250 ' 

Surface — 


Dwaine Studios 

"(S 

Example. 

0 200 


150 




Chapter 6 Multiple Regression I 24 


Algebraic Version of Normal Equations. The normal equations in algebraic form fo 
the case of two predictor variables can be obtained readily from (6.74) and (6.75). We have 


(X'X)b = X r Y 


n E^i 2 


V 


_ _ 

E 知 E 玲 EU/2 


b\ 

— 

E XiA 

_E 知 E E^?2 


b 2 _ 




from which we obtain the normal equations: 

^ — ^0 + ^1 欠 1‘1 + 办 2 X /2 

= b 0 Xu + b\ + b2 XnXi2 (6.77) 

KnYi = bo Xa + b\ y y^ y Xi i Xi2 + x"f 2 


Fitted" Values and Residuals 

To examine the appropriateness of regression model (6.69) for the data at hand, we require 
the fitted values % and the residuals ei = Y[ — Y t . We obtain by (6.28): 


Y=Xb 


P，] 

r 2 


"1 68.5 16.7" 

1 45.2 16.8 


'-68.857" 

1.455 


"187.2" 

154.2 

,y 2i _ 


1 52.3 16.0 


9.366 


157.1 


Further, by (6.29) we find: 

e = Y-Y 


' e\ " 


_ 174.( 


'187.2' 


'-12.8" 

ei 

— 

164.4 

_ 

154.2 

—— - 

10.2 

_^21 _ 


166.5 

- « 」 




_ 9_4_ 


Figure 6.5b shows the computer output for the fitted values and residuals to more decimal 
places than we have presented. • 


nalysis of Appropriateness of Model . 

We begin our analysis of the appropriateness of regression model (6.69) for the Dwaine 
Studios example by considering the plot of the residuals e against the fitted values Y in 
Figure 6.8a. This plot does not suggest any systematic deviations from the response plane, 



242 ^Rart Two Multiple Linear Regression 


FIGURE 6.8 

SYGRAPH 

Diagnostic 

Hots — Dwaine 

SUidios 

Example. 


120 170 220 270 30 40 50 60 70 80 90 100 

Fitted Targtpop 

(c) Residual Plot against X 2 (d) Residual Plot against KiK 2 



Dispoirtc XIX2 


(a) Residual Plot against Y (b) Residual Plot against 心 



nor that the variance of the error terms varies with the level of Y. Plots of the residuals e 
against X\ and X 2 in Figures 6.8b and 6.8c, respectively, are entirely consistent with the 
conclusions of good fit by the response function and constant variance of the error terms. 

In multiple regression applications, there is frequently the possibility of interaction ef¬ 
fects being present. To examine this for the Dwaine Studios example, we plotted the read- 
uals e against the interaction term X } X 2 in Figure 6.8d. A systematic pattern in this plot 
would suggest that an interaction effect may be present, so that a response function of the 
type: 


E{Y} = + j^X 2 + ^X x X 2 


might be more appropriate. Figure 6.8d does not exhibit any systematic pattern; hence, no 
interaction effects reflected by the model term ^X\X 2 appear to be present. 



Chapter 6 Multiple Regression l 243 


FIGURE 6.9 


(a) 

Additional 


Plot of Absolute 

Diagnostic 


Residuals against Y 

Plots—Dwaine 

25 r 


Studios 



Example. 

20 - 



m 




170 220 

Fitted 


-30 -20 -10 0 10 20 30 

Expected 


Figure 6.9 contains two additional diagnostic plots. Figure 6.9a presents a plot of the 
’ absolute residuals ag^nst the fitted values. There is no indication of nonconstancy of the 
error variance. Figure 6.9b contains a normal probability plot of the residuals. The pattern 
is moderately linear. The coefficient of correlation between the ordered residuals and their 
expected values under normality is .980. This high value (the interpolated critical value in 
T^ble B.6 for n = 21 and a = .05 is .9525) helps to confirm the reasonableness of the 
conclusion that the error terms are fairly normally distributed. 

Since the Dwaine Studios data are cross-sectional and do not involve a time sequence, 
a time sequence plot is not relevant here. Thus, all of the diagnostics support the use of 
regression model (6.69) for the Dwaine Studios example. 

Analysis of Variance 

To test whether sales are related to target population and per capita disposable income, we 
require the ANOVA table. The basic quantities needed are： 

T 174.41 


YY = [174.4 164.4 … 166.5] 


164.4 


166.5 


G) 


= 721,072.40 


MY = -[174.4 164.4 


166.5] 


174.4' 

164.4 

166.5 


(3,820.0) : 


= 694,876.19 





244 Part Two Multiple Linear Regression 


Thus: 


SSTO 二 Y Y - Y JY = 721,072.40 - 694,876.19 = 26,196.21 
and, from our results in (6.72) and (6.76): 


SSE YY-bXY 

= 721,072.40-[-68.857 1.455 9.366] 

= 721,072.40- 718,891.47 = 2,180.93 


3,820 

249,643 

66,073 


Finally, we obtain by subtraction: 

SSR = SSTO — SSE = 26,196.21 - 2,180.93 = 24,015.28 

v 

These sums of squares are shown in the SYSTAT ANOVA table in Figure 6.5a. Also 
shown in the ANOVA table are degrees of freedom and mean squares. Note that three 
regression parameters had to be estimated; hence, 21—3=18 degrees of freedom are 
associated with SSE. Also, the number of degrees of freedom associated with SSR is 
2—the number of X variables in the model. 

Test of Regression Relation. To test whether sales are related to target population and 
per capita disposable income： 

//()： /3| = 0 and 氏 = 0 

H a : not both P\ and 氏 equal zero 


we use test statistic (6.39b): 


F* = 


MSR 一 12,007.64 
MSE 二一 12L1626 


= 99.1 


This test statistic is labeled F-RATIO in the SYSTAT output. For a = .05, we require 
F(.95; 2, 18) = 3.55. Since F* = 99.1 > 3.55, we conclude H tl , that sales are related to 
target population and per capita disposable income. The P-value for this test is .0000, as 
shown in the SYSTAT output labeled P. 

Whether the regression relation is useful for making predictions of sales or estimates of 
mean sales still remains to be seen. 

Coefficient of Multiple Determination. For our example, we have by (6.40): 


R 2 = 


SSR 24,015.28 


SSTO 26,196.21 


=.917 


Thus, when the two predictor variables, target population and per capita disposable income, 
are considered, the variation in sales is reduced by 91.7 percent. The coefficient of multipte 
determination is shown in the SYSTAT output labeled SQUARED MULTIPLE R- Also 
shown in the output is the coefficient of multiple correlation R = .957 and the adjusted 
coefficient of multiple determination (6.42), — .907, which is labeled in the output 



Chapter 6 Multiple Regression I 245 


ADJUSTED SQUARED MULTIPLE R. Note that adjusting for the number of predictor 
variables in the model had only a small effect here on R 2 . 

Estimation of Regression Parameters 

Dwaine Studios is not interested in the parameter j6o since it falls fer outside the scope of 
the model. It is desired to estimate Pi and ^2 jointly with family confidence coefficient .90. 
We shall use the simultaneous Bonfeironi confidence limits (6.52). 

First, we need the estimated variance-covariance matrix s 2 {b)： 

s 2 {b} = MSE(XXr l 

MSEis given in Figure 6.5a, and (X'X) _1 was obtained in (6.73). Hence： 

" 29.7289 .0722 

s 2 {b) = 121.1626 .0722 .00037 

-1.9926 -.0056 

3,602.0 8.748 -241.43 

8.748 .0448 -.679 

一 241.43 -.679 16.514 

The two estimated variances we require are: 

s 2 {bi) — .0448 or ：= .212 

s 2 {b 2 ) = 16.514 or s{b 2 ) = 4.06 

These estimated standard deviations are shown in the SYSTAT output in Figure 6.5 a, labeled 
STD ERROR, to four decimal places. 

Next, we require for g = 2 simultaneous estimates: 

B=t[l- .10/2(2); 183 = ?(-975; 18) = 2.101 

The two pairs of simultaneous confidence limits therefore are 1-455 士 2.101(.212) and 
9.366 士 2.101 (4.06), which yield the confidence intervals: 

1.01 < j6i < 1.90 
.84 < jS 2 < 17.9 

With family confidence coefficient .90, we conclude that 仏 falls between 1.01 and 1.90 
and that jS 2 falls between .84 and 17.9.' 

Note that the simultaneous confidence intervals suggest that both j6i and j 6 2 are positive, 

which is in accord with theoretical expectations that sales should increase with higher target 

population and higher per capita disposable income, the other variable being held constant. 

* * 

Wifnation of Mean Response J 

Dwaine Studios would like to estimate expected (mean) sales in cities with target population 
Xhi = 65.4 thousand persons aged 16 years or younger and per capita disposable income 



-1.9926" 

-.0056 1 

.1363 

(6.78) 



246 Part Two Multiple Linear Regression 


Xh 2 = 17.6 thousand dollars with a 95 percent confidence interval. We define: 


X h = 65.4 
17.6 

The point estimate of mean sales is by (6.55): 


-68.857 

Y h = X；b = [1 65.4 17.6] 1.455 

9-366 


:=191.1C)^v 


The estimated variance by (6.58), using the results in (6.78), is: 


or: 


s 2 {Y h ) = X^s 2 {b}X, 



"3,602.0" 8.748 -241.43 " 


1 

[1 65.4 17.6] 

8.748 .0448 -.679 


65.4 


-241.43 -.679 16.514 


17.6 


=1.656 


s{Y h } = 2.11 


For confidence coefficient .95, we need r(.975; 18) = 2.101, and we obtain by (6.59) 
the confidence limits 191.10±2.101 (2.77). The confidence interval for E{Y h } therefore 
is: 


185.3 < E{Y h } < 196.9 

Thus, with confidence coefficient .95, we estimate that mean sales in cities with target 
population of 65.4 thousand persons aged 16 years or younger and per capita disposable 
income of 17.6 thousand dollars are somewhere between 185.3 and 196.9 thousand dollars. 
Dwaine Studios considers this confidence interval to provide information about expected 
(average) sales in communities of this size and income level that is precise enough for 
planning purposes. 

Algebraic Version of Estimated Variance s z {y ft ). Since by (6.58): 

s 2 {Y h } = X ； s 2 {b}X, 

it follows for the case of two predictor variables in a first-order model: 

s 2 {Y h } = AW + 办 2 {h} + X 2 h2 s 2 [b 2 } + 2X hl s{bo, h} 

+ 2Xh2^{bo-> ^ 2 } + 2XhiXh2^{bi, ^ 2 } (6.79) 



Chapter 6 Multiple Regression I 247 


prediction Limits for New Observations 

Dwaine Studios as part of a possible expansion program would like to predict sales for two 
new cities, with the following characteristics: 

City A City B 

65.4 53.1 

X h2 17.6 1 入 7 

Prediction intervals with a 90 percent family confidence coefficient are desired. Note that 
the two new cities have characteristics that fall well within the pattern of the cities on 
which the regression analysis is based. 

To determine which simultaneous prediction intervals are best here, we find S as given 
in (6.65a) and B as given in (6.66a) for g = 2 and l — a = .90: 

S 2 = 2F(.90; 2,18) = 2(2.62) = 5.24 S = 2.29 

and: 

B = t[l- .10/2(2); 183 = K.975; 18) = 2.101 

Hence, the Bonferroni limits are more efficient here. 

For city A, we use the results obtained when estimating mean sales, since the levels of 
the predictor variables are the same here. We have from before: 

Y h ：= 19U0 s 2 {Y h } = 1.656 MSE = 121.1626 

Hence, by (6.63a): 

.y 2 {pred} = MSE + s 2 {Y h } = 121.1626 + 7.656 = 128.82 
or: 

.y{pred} = 11.35 

In similar fashion, we obtain for city B (calculations not shown): 

Y h = 174.15 Wpred} = 11.93 

We previously found that the Bonferroni multiple is B = 2.101. Hence, by (6.66) the simul- 
taneous Bonferroni prediction limits with family confidence coefficient .90 are 191.10 ± 
2.101(11.35) and 174.15 ±2.101(11.93), leading to the simultaneous prediction intervals: 

City A: 167.3 < F, (ne ；) < 214.9 
CityB:149.1<F, (new) <199.2- 

With family confidence coefficient .90, We predict that sales in the two cities will be within 
the indicated limits. Dwaine Studios considers these prediction limits to be somewhat useful 
for planning purposes, but would prefer tighter intervals for predicting sales fora particular 
city. A consulting firm has been engaged to see if additional or alternative predictor variables 
can be found that will lead to tighter prediction intervals. 



248 Part Two Multiple Linear Re^re.ssion 


Note incidentally that even though the coefficient of multiple determination, R 2 = .917, 
is high, the prediction limits here are not fully satisfactory. This serves as another reminder 
that a high value of R 2 does not necessarily indicate that precise predictions can be made. 


Cited 6.1. Box, G. E. P„ and P. W. Tidwell. ^Transformations of the Independent Variables,” Technometrics 

Reference 4(i962),pp. 531-50. 


Problems 


6J. Set up the X matrix and p vector for each of the following regression models (assume i = 
1.4): 

a - K' — A) + + + £ V ^ 

b. log Yj = ^0 + {i\ X,-] + + £ V 

6.2. Set up the X matrix and p vector for each of the following regression models (assume i =： 
l,.. •, 5): 

a. F/ = fi\Xi\ + + + £ 'f 

b- ->/yi = A) + ^1 I + ^2 l°g|0 ^i2 + £ v 

6.3. A student stated: “Adding predictor variables to a regression model can never reduce R 2 , so we 
should include all available predictor variables in the model,” Comment. 

6.4. Why is it not meaningful to attach a sign to the coefficient of multiple correlation R, although 
we do so for the coefficient of simple correlation r| 2 ? 

6.5. Brand preference. In a smal 卜 scale experimental study of the relation between degree of brand 
[iking (F) and moisture content (X|) and sweetness (X 2 ) of the product, the following results 
were obtained from the experiment based on a completely randomized design (data ai-e coded )： 


1 •: 

1 

2 

3 

» - • 

14 

15 

16 

入 /I : 

4 

4 

4 


10 

10 

10 

入， 2: 

2 

4 

2 

i .. 

4 

2 

4 

1 

64 

73 

61 

■ ■ i 

95 

94 

100 


a. Obtain the scatter plot matrix and the correlation matrix. What information do these diag¬ 
nostic aids provide here? 

h Fit regression model (6.1) to the data. State the estimated regression function. How is b t 
interpreted here? 

c. Obtain the residuals and prepare i\ box plot of the residuals. What inforrmuion does this plot 
provide? 

d. Plot the residuals against Y, X x , X 2 , and X\X 2 on separate graphs. Also prepare a normal 
probability plot. Interpret the plots and summarize your findings. 

e. Conduct the Breusch-Pagan test for constancy of the error variance, assuming log of ~ 
Yi) + y\ 欠 /i + Vi^n- use a = .01. State the alternatives, decision rule, and conclusion. 

f. Conduct a formal test for lack of fit of the first-order regression function; usea = .01. State 
the alternatives, decision rule, and conclusion. 


6.6. Refer to Brand preference Problem 6.5. Assume that regression model (6.1) with independent 
normal error terms is appropriate. 

a. Test whether there is a regression relation, using a = .01. State the alternatives, decision 
rule, and conclusion. What does your test imply about and 佐？ 



Chapter 6 Multiple Regression I 249 


b. What is the P-value of the test in part ⑻？ 

c. Estimate and ^jointly by the Bonferroni procedure，using a 99 percent family confidence 

coefficient* Interpret your results. 

6.7. Refer to Brand preference Problem 6.5. 

a. Calculate the coefficient of multiple determination R 2 . How is it interpreted here? 

b. Calculate the coefficient of simple determination R 2 between 1/ and jP；. Does it equal the 
coefficient of multiple determination in part (a)? 

6.8. Refer to Brand prrference Problem 6.5. Assume that regression model (6.1) with independent 
normal error terms is appropriate. 

a. Obtain an interval estimate of when X h \ = 5 and Xhi = 4. Use a 99 percent confidence 

coefficient. Interpret your interval estimate. * 

b. Obtain a prediction interval for a new observation when X h \ = 5 and Xhi = 4. Use a 
99 percent confidence coefficient. 

*6.9. Grocery retmler. A laige, national grocery retailer tracks prodictivity and costs of^ts facilities 
closely. Data below were obtained from a single distribution center fora one-year period. Each 
data point for each variable represents one week of activity. The variables included are the 
number of cases shipped (Xi), the indirect costs of the total labor hours as a percentage (X 2 ), 
a qualitative predictor called holiday that is coded 1 if the week has a holiday and 0 otherwise 
(X 3 ), and the total labor hours (F). 


1 : 1 2 B ... 50 51 52 


Xn ： 

305,657 

328,476 

317,164 

... 290,455 

411,750 

292,087 

X/2 ： 

7.17 

6.20 

4.61 

... 7.99 

7.83 

7.77 

X/3 ： 

0 

0 

0 

... 0 

0 

0 

Yr- 

4264 

4496 

4317 

... 4499 

4186 

4342 


a. Prepare separate stem-and-leaf plots for the number of cases shipped X 。 and the indirect 
cost of the total hours X i2 . Are there any outlying cases present? Are there any gaps in the 
data? 

b. The cases are given in consecutive weeks. Prepare a time plot for each predictor variable. 

- What do the plots show? 

c. Obtain the scatter plot matrix and the correlation matrix. What information do these diag¬ 
nostic aids provide here? 

*6.10. Refer to Grocay retailer Problem 6.9. 

a. Fit regression model (6.5) to the data for three predictor variables. State the estimated 
regression function. How are b u b 2 , and interpreted here? 

b. Obtain the residuals and prepare a,box plot of the residials. What information does this plot 

provide? „ 

c. Plot the residuals against Y, X 2 , X 3 , and on separate graphs. Also prepare a normal 
probability plot. Interpret the plots and summarize your findings. 

d. Prepare a time plot of the residuals. Is there any indication4hat the error terms are correlated? 
Discuss. 

e. Divide the 52 cases into two groups, placing the 26 cases with the smallest fitted values 
Y { into group 1 and the other 26 cases into group 2. Conduct the Brown-Forsythe test for 
constancy of the error variance, using a = .01. State the decision rule and conclusion. 






250 Part Two Multiple Linear Regression 


302,000 

7.20 

0 


245,000 

7.40 

0 


280,000 
^ 6.90 
0 


350,000 

7.00 

0 


295,000 

6.70 


Obtain the family of estimates using a 95 percent family confidence coefficient. Employ the 
Working-Hotel ling or the Bonferroni procedure, whichever is more efficient. 

b. For the data in Problem 6.9 on which the regression fit is based, would you consider a 
shipment of 400,000 cases with an indirect percentage of 7.20 on a nonholiday week to be 
within the scope of the model? What about a shipment of 400,000 cases with an indirect 
percentage of 9.9 on a nonholiday week? Support your answers by p epaiing a relevant plot. 

*6.13. Refer to Grocery retailer Problem 6.9. Assume that regression model (6.5) for three predictor 
variables with independent normal error terms is appropriate. Four separate shipments with the 
following characteristics must be processed next month: 



1 

2 

3 

4 

X 1： 

230,000 

250,000 

280,000 

340,000 

x 2 ： 

7.50 

730 

7.10 

6.90 

入 3: 

0 

0 

0 

0 


Management desires predictions of the handling times for these shipments so that the actual 
handling times can be compared with the predicted times to determine whether any are out of 
line. Develop the needed predictions, using the most efficient approach and a family confidence 
coefficient of 95 percent. 

*6.14. Refer to Grocery retailer Problem 6.9. Assume that regression model (6.5) for three predicta* 
variables with independent normal error terms is appropriate. Three new shipments are to be 
received, each with X it \ — 282,000, X ia = 7.10, and X fl3 = 0. 

a. Obtain a 95 percent prediction interval for the mean handling time for these shipments. 

b. Convert the interval obtained in part (a) into a 95 percent prediction interval for the total 
labor hours for the three shipments. 

*6.15. Patient satisfaction. A hospital administrator wished to study the relation between patient 
satisfaction (F) and patient's age (X!, in years), severity of illness (Xo, an index), and anxiety 


*6.11 . Refer to Grocery retailer Problem 6.9. Assume that regression model (6.5) for three predictor 
vaiiables with independent normal error terms is appropriate. 

a. Test whether there is a regression relation, using level of significance .05. State the alterna¬ 
tives, decision rule, and conclusion. What does your test result imply about and 爲 ? 

What is the P-value of the test? 

b. Estimate 办 and jointly by the Bonfei roni procedure, using a 95 percent family confidence 
coefficient. Interpret your results. 

c. Calculate the coefficient of multiple determination /? 2 . How is this measure interpreted here? 

*6.12. Refer to Grocery retailer Problem 6.9. Assume that regression model (6.5) for three predictoi- 
variables with independent normal error terms is appropriate. 

a. Management desires simultaneous rntemil estimates of the total labor hours for the following 
five typical weekly shipments: 

1 2 3 4 5 


12 3 

XXX 





Chapter 6 Multiple Regression I 251 


level (X 3 , an index). The administrator randomly selected 46 patients and collected the data 
presented below, where larger values of Y, X 2 , and X 3 are, respectively, associated with more 
satisfaction, increased severity of illness, and more anxiety. 


1 ： 1 2 3 44 45 46 



50 

36 

40 . 

. 45 

37 

28 


51 

46 

48 . 

. 51 

53 

46 

X ；3 ： 

2.3 

2.3 

2.2 - 

. 2.2 

2.1 

1.8 

Yr- 

48 

57 

66 , 

. 68 

59 

92 


a. Prepare a stem-and-leaf plot for each of the predictor variables. Are any noteworthy features 
revealed by these plots? 

b. Obtain the scatter plot matrix and the correlation matrix. Interpret these andtstate your 
principal findings. 

c. Fit regression model (6.5) for three predictor variables to the data and state the estimated 
regression function. How is interpreted here? 

d. Obtain the residuals and prepare a box plot of the residuals. Do there appear to be any 
outliers? 

e. Plot the residuals against Y, each of the predictor variables, and each two-factor interaction 
term on separate graphs. Also prepare a normal probability plot. Interpret your plots and 
summarize your findings. 

f. Can you conduct a formal test for lack of fit here? 

g. Conduct the Breusch-Pagan test for constancy of the error variance, assuming log of = 
Yo + Vi^n + yiXi 2 + K 3 ^' 3 i use of = .01. State the alternatives, decision rule, and 
conclusion. 

*6.16. Refer to Patient satisfaction Problem 6.15. Assume that regression model (6.5) for three 
predictor variables with independent normal error terms is appropriate. 

a. Test whether there is a regression relation; use of = .10, State the alternatives, decision rule, 
and conclusion. What does your test imply about fii, and j 6 3 ? What is the 尸 -value of the 
test? 

b. Obtain joint interval estimates of 0i, 你 ， and j 6 3 , using a 90 percent family confidence 
coefficient. Intapret your results. 

c. Calculate the coefficient of multiple determination. What does it indicate here? 

*6.17. Refer to Patient satisfaction Problem 6.15. Assume that regression model (6.5) for three 
predictor variables with independent normal error terms is appropriate. ' 

a. Obtain an interval estimate of the mean satisfaction when X f a = 35, X h 2 = 45, and = 2.2. 

Use a 90 percent confidence coefficient* Interpret your confidence interval. 

b. Obtain a prediction interval for a new patient’s satisfaction when X h \ = 35, Xhi = 45, and 
X /,3 = 2.2. Use a 90 percent confidence coefficient. Interpret your prediction interval 

6.18. Commercial properties. A commercial real estate company evaluates vacancy rates, square 
footage, rental rates, and operating expenses for commercial properties in a large metropolitan 
area in order to provide clients with quantitative information upon which to make rental deci¬ 
sions. The data below are taken from 81 suburban commercial properties that are the newest, 
best located, most attractive, and expensive for five specific geographic areas. Shown here are 






252 Part Two Multiple Linear Regression 


the age (X| )，operating expenses and taxes (X 2 ), vacancy rates (X 3 ), total square footage (X 4 ), 
and rental rates (K). 


I': 

i 

2 

3 ... 

79 

80 

81 

入 n: 


14 

16 ... 

15 

n 

14 

X/2: 

5.02 

8.19 

3.00 ... 

11.97 

11.27 

12.68 

入， 3: 

0.14 

0.27 

0 ... 

0.14 

0.03 

0.03 

X/4 ： 

123,000 

104/079 

39,998 … 

254,700 

434,746 

201,930 

K ；： 

13.50 

12.00 

10.50 ... 

15.00 

15.25 

14.50 


a. Prepare a stem-and-leaf plot for each predictor variable. What infethiation do these plots 
provide? 

b. Obtain the scatter plot matrix and the correlation matrix. Interpret these and state your 
principal findings- 

c. Fit regression model (6.5) for four predictor^vai'iables to the data. State the estimated 
regression function. 

d. Obtain the residuals and prepare a box plot of the residuals. Does the distribution appear to 
be fairly symmetrical? 

e. Plot the residuals against V, each predictor variable, and each two-factor interaction term on 
separate graphs. Also prepare a normal probability plot. Analyze your plots and summarize 
your findings. 

f. Can you conduct a formal test for lack of fit here? 

g. Divide the 81 cases into two groups, placing the 40 cases with the smallest fitted values P,. 
into group 1 and the remaining cases into group 2. Conduct the Brown-Foi'sythe test fcr 
constancy of the error variance, using a = .05. State the decision rule and conclusion. 

6.19. Refer to Commercial properties Problem 6.18. Assume that regression model (6.5) for four 
predictor variables with independent normal error terms is appropriate. 

a. Test whether there is a regression relation; use a = .05. State the alternatives, decision rule, 
and conclusion. What does your test imply about j0 2 , 你 ， and j£J 4 ? What is the P-value 
of the test? 

b. Estimate 爲 ] ， 免，你 ， and 爲 jointly by the Bonferroni procedure, using a 95 percent family 
confidence coefficient. Interpret your results. 

c. Calculate R 2 and interpret this measure. 

6.20. Refer to Commercial properties Problem 6.18. Assume that regression model (6.5) for four 
predictor vaiiables with independent normal error terms is appropriate. The researcher wishes 
to obtain simultaneous interval estimates of the mean rental rates for four typical properties 
specified as follows: 



1 

2 

3 

4 


5.0 

6.0 

14.0 

12.0 

入 2: 

8.25 

8.50 

11.50 

10.25 


0 

0.23 

0.11 

0 

x 4 ： 

250,000 

270,000 

300,000 

310,000 


Obtain the family of estimates using a 95 percent family confidence coefficient. Employ the 
most efficient procedure. 








Chapter 6 Multiple Regression I 253 


6,21. Refer to Commercial properties Problem 6J8. Assume that regression model (6.5) for four 
predictor variables with independent normal error terms is appropriate. Three properties with 
the following characteristics did not have any rental information available. 



1 

2 

3 

A: 

4.0 

6.0 

12,0 

x 2 ： 

10,0 

11.5 

12.5 

x 3 : 

0.10 

0 

032 

x 4 : 

80,000 

120,000 

340,000 


Develop separate prediction intervals for the rental rates of these properties, using a 95 per¬ 
cent statement confidence coefficient in each case- Can the rental rates of these three prop¬ 
erties be predicted fairly precisely? What is the family confidence level for the^set of three 
predictions? 


6.22. For each of the following regression models, indicate whether it is a general linear regres¬ 
sion model. If it is not, state whether it can be expressed in the form of (6,7) by a suitable 
transformation; 

a - Y ； = 0q + p\ Xu + ^2 log 10 X/2 + + f i' 

b- Yi = El exp(j6 0 + fiiXn + ^Xf 2 ) 
c- Y t = \og lo (0iX n ) + hXa + £i 

d. Yi = j6 0 expO0iX ;1 ) + £/ 

e, 3^- = [1 + exp(^o + fii + £,-)] ' 

6.23, (Calculus needed-) Consider the multiple regression model: 

Yi = fiiXn + + / = 1, …， n 

where the £7 are uncorrelated, with = 0 and cr 2 {f/) = a 2 , 
a/ State the least squares criterion and derive the least squares estimators of and 
b. Assuming that the e; are independent normal random variables, state the likelihood function 
and obtain the maximum likelihood estimators of and ^ Are these the same as the least 
squares estimators? 

6-24. (Calculus needed-) Consider the multiple regression model; 

Yi = + + A^'2 + f = 1， " 

where the 灼 are independent A^(0, cr 2 ). 

a. State the least squares criterion and derive the least squares normal equations, 

b. State the litelihood function an^ explain why the maximtim likelihood estimators will be 
the same as the least squares estimators, 

6.25. An analyst wanted to fit the regression model 17 — j 6 o + + J 63 X /3 + £；, 

i = 1,... ,n,by the method of least squares when it is known that j 6 2 = 4. How can the analyst 
obtain the desired fit by using a multiple regression computer program? 

6.26. For regression model (6,1)，show that the coefficient of simple determination between Y { and 
Yi equals the coefficient of multiple determination R 2 , 






254 Part Two Multiple Linear Regression 


6,27. In a small-scale regression study, the following data were obtained: 

_h _ 1 _2 _3_4 5_ S_ 

X n : 7 4 16 3 21 8 

X !2 ： 33 41 7 49 5 31 

Y,: 42 33 75 28 91 55 

Assume that regression model (6.1) with independent normal error terms is appropriate. Using 
matrix methods, obtain ⑻ b; (b) e; ⑹ H; (d) SSR; (e) s 2 {b); (f) Yr, when X r ,i = 10, X fl2 = 30; 
(g) ^ 2 {^) when & = ] 0 ,〜=m 


Projects 6-28. Refer to the CDI data set in Appendix C.2. You have been asked to evaluate two alternative 

models for predicting the number of active physicians (K) in a CDI. Proposed model I includes 
as predictor variables total population (X])，[and area (X 2 ), and total personal income (X 3 ). 
Proposed model tl includes as predictor variables population density (X,, total population 
divided by land area), percent of population greaterthan 64 yeai-s old (X 2 ), and total personal 
income (X 3 ). 

a. Prepare a stem-and-leaf plot for each of the predictor variables. What noteworthy information 
is provided by your plots? 

b. Obtain the scatter plot matrix and the correlation matrix for each proposed model. Summarize 
the information provided. 

c. For each p'oposed model, fit the first-order regression model (6.5) with three predictcr 
variables. 

丄 Calculate R 2 for each model. Is one model clearly preferable in terms of this measure? 

e. For each model, obtain the residuals and plot them against F, each of the three predictor 
variables, and each of the two-factor interaction terms. Also prepare a normal probability 
plot for each of the two fitted models. Interpret your plots and state your findings. Is one 
model clearly preferable in terms of appropriateness? 


6.29 - Refer to the CDI data set in Appendix C.2. 

a. For each geographic region, regress the number of serious crimes in a CDI (K) against 
population density (X b total population divided by land area), per capita personal income 
(X 2 ), and percent high school graduates (X 3 ), Use first-order regression model (6.5) with 
three predictor variables. State the estimated regression functions. 

b. Are the estimated regression functions similar for the four regions? Discuss. 

c. Calculate MSE and R 2 for each region. Are these measures similar for the four regions? 
Discuss. 

d. Obtain the residuals for each fitted model and prepare a box plot of the residuals for each 
fitted model. Interpret your plots and state your findings. 

6.30. Refer to the SENIC data set in Appendix C. 1. Two models have been proposed for predicting the 
average length of patient stay in a hospital (K). Model I utilizes as predictor variables age (X]), 
infection risk (Xj), and avai lable facilities and services (X 3 ). Model II uses as predictor variables 
number of beds (Xi), infection risk (X 2 ), and available facilities and services (X 3 ). 

a. Pi'epare a stem-and-leaf plot for each of the predictor variables. What information do these 
plots provide? 


b. Obtain the scatter plot matrix and the correlation matrix for each proposed model. Interpret 
these and state your principal findings. 



Chapter 6 Multiple Regression I 255 

c. For each of the two proposed models, fit first-order regression model (6.5) with three pre¬ 
dictor variables. 

d. Calculate R 2 for each model. Is one model clearly preferable in terms of this measure? 

e. For each model, obtain the residuals and plot them against Y, each of the three predictor 
variables, and each of the two-factor interaction terms. Also prepare a normal probability 
plot of the residuals for each of the two fitted models. Interpret your plots and state your 
findings. Is one model clearly more appropriate than the other? 

6.31. Refer to the SENIC data set in Appendix C. 1. 

a. For each geographic region, regress infection risk (7) against the predictor variables age 
(Xi), routine culturing ratio (X 2 ), average daily census CX 3 ) ， 切 nd available facilities and 
services (X 4 ). Use first-order regression model (6.5) with four predictor variables. State the 
estimated regression functions. 

b. Are the estimated regression functions similar for the four regions? Discuss, t 

c. Calculate MSE and R 2 for each region. Are these measures similar for the four regions? 
Discuss. 

d. Obtain the resi^ials for each fitted model and prepare a box plot of the residuals for each 
fitted modeL Interpret the plots and state your findings. 



Chapter 


Multiple Regression II 


In this chapter, we take up some specialized topics that are unique to multiple regression. 
These include extra sums of squares, which are useful for conducting a variety of tests about 
the regression coefficients, the standardized version of the multiple regression model, and 
multicollinearity, a condition where the predictor variables are highly correlated- 

7.1 Extra Sums of Squares_ 

Basic Ideas 

An extra sum of squares measures the marginal reduction in the error sum of squares 
when one or several predictor variables are added to the regression model, given that other 
predictor variables are already in the model. Equivalently, one can view an extra sum of 
squares as measuring the marginal increase in the regression sum of squares when one or 
several predictor variables are added to the regression model. 

We first utilize an example to illustrate these ideas, and then we present definitions of 
extra sums of squares and discuss a variety of uses of extra sums of squares in tests about 
regression coefficients. 

Table 7.1 contains a portion of the data for a study of the relation of amount of body fat 
(y) to several possible predictor variables, based on a sample of 20 healthy females 25- 
34 years old. The possible predictor variables are triceps skinfold thickness (X\), thigh 
circumference (X 2 ), and midarm circumference (X 3 ). The amount of body fat in Table 7.1 
for each of the 20 persons was obtained by a cumbersome and expensive procedure requiring 
the immersion of the person in water. It would therefore be very helpful if a regression 
model with some or all of these predictor variables could provide reliable estimates of the 
amount of body fat since the measurements needed for the predictor variables are easy to 
obtain. 

Table 7.2 contains some of the main regression results when body fat {Y) is regressed 
(i) on triceps skinfold thickness (X,) alone, (2) on thigh circumference (X 2 ) alone, (3) on 
Xi and X 2 only, and (4) on all three predictor variables. To keep track of the regression 
model that is fitted, we shall modify our notation slightly. The regression sum of squares 
when X\ only is in the model is, according to Table 7.2a, 352.27. This sum of squares 
will be denoted by SSR{X\). The error sum of squares for this model will be denoted by 
SSE(X'y, according to Table 7.2a it is SSE(X、）= 143.12. 


Example 


256 



Chapter 7 Multiple Regression II 257 


Similarly, Table 7.2c indicates that when X\ and X 2 are in the regression model, 
the regression sum of squares is SSR(Xi, X 2 ) = 385.44 and the error sum of squares is 
SSE (Xi, X 2 ) = 109.95. 

Notice that the error sum of squares when X\ and X 2 are in the model, SSE(X\, X 2 )= 
109-95, is smaller than when the model contains only Xi, SSE(X\) = 143.12. The difference 
is called an extra sum of squares and will be denoted by 


= SSEiXi) - SSE(X U X 2 ) 
= 143.12- 109-95 = 33.17 


TABLE 7A 

Basic 

Data—Body 

Fat Example. 

Subject 
.• 

1 

1 

Triceps 

Skinfold Thickness 

19.5 

Thigh 

Circumference 

/12 

43.1 

Mparm 

Circurftfei%nce 

29.1 

Body Fat 

r, 

^ 11.9 


2 

24.7 

49.8 

28.2 

22.8 


3 

30.7 

51.9 

37.0 

18.7 


18 

30.2 

58.6 

24.6 

25.4 

〆 

19 

22.7 

48.2 

27.1 

14.8 


20 

25 2 

51.0 

27.5 

21.1 


TABLE 7.2 
R^ression 
Results for 
Several Fitted 
Models ― Body 
Fat Example. 



(a) Regression of T on X% 



1.496 十多 572★ 


Source of 


' " 


Variation 

SS 

df 

MS 

Regression 

352.27 

1 

352.27 

Error 

143.12 

18 

7.95 

Total 

495.39, 

19 



Estimated 

Estimated 


Variable 

Regression Coefficient 

Standard Deviation 

t* 

Xi 

i?i = *8572, 

5(/?!} = .1288 

6.66 


(b)、Regression of Y on X2 



-23.634+.S565X 2 


Source of 
Variation 

SS , 

df 

MS 

Regression 

361.97 

1 

381.97 

Error 

113^.42 

t .' 18 

630 

Total 

495.39 

19 



Estimated 

Estimated 


Variable 

Regression Coefficient 

Standard Deviation 

t* 


b 2 = -8565 

s{b 2 } = .1 TOO 

7.79 


{continued ) 




258 Part Two 

TABLE 7.2 

(Continued). 


Multiple Linear Regression 


(c) Regression of Y on X-\ and X 2 
? = -19.174+.2224Xt + .6594X 2 

Source of 




Variation 

55 

df 

MS 

Regression 

385.44 

2 

192.72 

Error 

109.95 

17 

6.47 

Total 

495.39 

19 



Estimated 

Estimated 


Variable 

Regression Coefficient 

Standard Deviation 

t* 

Xi 

卜 =.2224 

sl^} = .3034^ 

.73 

入 2 

b 2 = .6594 

s{b 2 }=2912 

2.26 


(d) Regression of Y on Xi, X 2 , and X 3 



9 = 117.08 +4.334 X! ~ 2.857X 2 - 2.1 86 X 3 


Source of 




Variation 

55 

df 

MS 

Regression 

396.98 

3 

132.33 

Error 

98.41 

16 

6.15 

Total 

495.39 

19 



Estimated 

Estimated 


Variable 

Regression Coefficient 

Standard Deviation 

r 

Xi 

= 4.334 

s{^} = 3.016 

1.44 

入 2 

b 2 = -2.857 

s{b 2 ] = 2.582 

-i.n 

入 3 

b 3 = -2.186 

5 (^ 3 } = 1,596 

-1.37 


This reduction in the error sum of squares is the result of adding X 2 to the regression model 
when Xi is already included in the model. Thus, the extra sum of squares 
measures the marginal effect of adding X 2 to the regression model when X, is already in 
the model. The notation 5'5'/?(X2|X 1 ) reflects this additional or extra reduction in the error 
sum of squares associated with X 2 , given that X [ is already included in the model. 

Hie extra sum of squares 5'5'/?(X2 jX|) equivalently can be viewed as the margi nal increase 
in the regression sum of squai'es: 

SSR{Xn\X x ) = SSR(X U X 2 ) - SSR(X { ) 

= 385.44 - 352.27 = 33.17 

The reason for the equivalence of the marginal reduction in the error Sum of squares and 
the marginal increase in the regression sum of squares is the basic analysis of variance 
identity (2.50): 

SSTO = SSR + SSE 

Since SSTO measures the variability of the Yi observations and hence does not depend on 
the regression model fitted, any reduction in SSE implies an identical increase in SSR. 



Chapter 7 Multiple Regression II 259 


〆 


Definitions 


We can consider other extra sums of squares, such as the marginal effect of adding X 3 to 
the regression model when X 1 and X 2 are already in the model. We find from Tables 7 . 2 c 
and 7 - 2 d that: 


SSR(X 3 IU 2 ) 二 SSE(X l ,X 2 )- SSE(X u X 2 , X 3 ) 
=109.95-98.41 = 11.54 


or, equivalently: 


SSR(X 3 \X u X2) = SSR(X u X 2 ,X 3 )~ SSR(X u X 2 ) 

= 396.98 - 385.44 = 11.54 . 

We can even consider the marginal effect of adding several variables, such as adding 
both X 2 and X 3 to the regression model already containing X\ (see Tables 7.2a and 7.2d): 

SSR(X 2 , X 3 |X,) = SSEd) - SSE(X U X 2 , X 3 ) 

=143.12-98.41 =44.71 


or, equivalently: 


SSR(X 2 , X 3 |X,) = SSR(X U X 2i X 3 ) - SSRiX,) 
- 396.98 — 352.27 - 44.71 


We assemble now our earlier definitions of extra sums of squares and provide some addi¬ 
tional ones. As we noted earlier, an extra sum of squares always involves the difference 
between the error sum of squares for the regression model containing the X variable(s) 
already in the model and the error sum of squares for the regression model containing both 
the original X variable(s) and the new X variable(s). Equivalently, an extra sum of squares 
involves the difference between the two corresponding regression sums of squares. 

Thus, we define ： 

SSRiXi\X 2 ) = SSE(X 2 ) - SSE(X U X 2 ) (7.1a) 

or, equivalently: 

S^(X,|X 2 ) = SSR(X U X 2 ) - SSR(X 2 ) (7.1b) 

If X 2 is the extra variable, we define ： 

SSRCXdD = SSE(Xi) - SSE(X l ,X 2 ) (7.2a) 

f 

or, equivalently: — 

SSR(X 2 \X l ) = SSR(X l ,X 2 ) - SSR(X } ) (7.2b) 

Extensions for three or more variables are straightforward ； For example, we define: 

SSR(X 3 \X U X 2 ) = &SE(X U X 2 ) - SSE(X u X 2 ,X 3 ) (7.3a) 

or: 

X 2 ) = SSR(X U X 2 , X 3 ) - SSRiXy, X 2 ) (7.3b) 



260 Part Two Multiple Linear Re^tv^.sion 


and: 

SSR(X 2 , m) = SSE(X '、一 SSE(X { ,X 2 ,X 3 ) (7*4a) 

or: 

SSR(X 2 , X 3 |X,) = SSR(XuX 2 ,X：,)~ SSR(X a ) (7.4b) 

Decomposition of SSR into Extra Sums of Squares 

In multiple regression, unlike simple linear regression, we can obtain a variety of decom¬ 
positions of the regression sum of squares SSR into extra sums of squares. Let us consider 
the case of two X variables. We begin with the identity (2.50) for vaiiable X i : 

SSTO = SSR(X l )-\- SSEiXi) ^ (7.5) 

where the notation now shows explicitly that Xi is the X variable in the model. Replacing 
SSE(Xi) by its equivalent in (7.2a), we obtain: 

SSTO = SSRiXx)-^ + SSE(X U X 2 ) (7.6) 

We now make use of the same identity for multiple regression with two X variables as 
in (7,5) fora single X variable, namely: 

SSTO 二 SSRiX^X^J + SSE(X^X 2 ) (7.7) 

Solving (7.7) for SSE(X\, X 2 ) and using this expression in (7.6) lead to; 

SSR(X U X 2 ) = SSRiX^ + SSRiX.JXi) (7.8) 

Thus, we have decomposed the regression sum of squares SSR(X i,X 2 ) into two marginal 
components: (l) SSR(X\)^ measuring the contribution by including X 1 alone in the model, 
and ( 2 ) SSR(X 2 \Xi), measuring the additional contribution when X 2 is included, given that 

is already in the model. 

Of coui^e, the order of the X variables is arbitrary. Here, we can also obtain the 
decomposition: 

SSR(X u X 2 ) = SSR(X 2 ) + SSR(X t \X 2 ) (7.9) 

We show in Figure 7.1 schematic representations of the two decompositions of 
SSR(X\, X 2 ) for the body fat example. The total bar on the left represents SSTO and 
' presents decomposition (7.9). The unshaded component of this bar is SSR(X 2 ), and the 

combined shaded area represents SSE(X 2 ) ‘ The latter area in turn is the combination of the 
extra sum of squares SSR(X\\Xo) and the error sum of squares SSE(X “ X 2 ) when both 
X, and X 2 are included in the model. Similarly, the bar on the right in Figure 7.1 shows 
decomposition (7.8). Note in both cases how the extra sum of squares can be viewed either 
as a reduction in the error sum of squares or as an increase in the regression sum of squares 
when the second predictor variable is added to the regression model. 

When the regression model contains three X variables, a variety of decompositions of 
SSR(X\, X 2 , X 3 ) can be obtained. We illustrate three of these: 

SSRiX^X^.X^) = SSR(X l ) + SSRiX.JX^ + SSR(X^\X ^ X 2 ) (7.10a) 
SSR(X^X 2 ,X,) = SSR(X 2 ) + SSR(X 3 \X-) + ^(X,|X 2 ,X 3 ) (7.10b) 
55/?(X,,X 2 . X 3 ) = SSR(X y ) + SSR(X 2 ' X 3 |X,) (7.10c) 



Chapter 7 Multiple Regression II 261 


FIGURE 7.1 Schematic Representation of Extra Sums of Squares~Body Fat Example. 
SSTO = 495.39 SSTO = 495.39 


SSR(X 2 ) = 381.97 


SSRfX^X^ = 3.47 


SSE(X2> = 113.42 < 


= 385.44 


SS£(X V = 109.95 


P 


> SSR^) = 352.27 




— SSRCXzlX^ = 33.17 
> SSECXO = 143.12 


TABLE 7.3 

Example of 

Source of 
Variation 

SS 

df 

MS 

ANOVA Table 
with 

Regression 

SSR(X, f X 2f X 3 ) 

3 

MSR(X u x 2f x 3 ) 

Decomposition 

Xi 

ssrIx^) 

1 

/W57?(Xi) 

of SSR for 

X 2 |X T 

55/?(X 2 |X r ) ; , 

1 

MSR(X 2 \X0 

Three Z 

入 3l 入 1/ 入 2 

55 ^X 31 X 1 , X 2 ) 

1 

MSR(X 3 \X, f X 2 ) 

Variables. 

Error 

SSE(Xr f X 2f X 3 ) 

n — 4 

MSECKu X 2f X 3 ) 


Total 

SSTO 

n —1 



1 


It is obvious that the number of possible decompositions becomes vast as the number of 
X variables in the regression model increases. 


ANOVA Table Containing Decomposition of SSR 

ANOVA tables can be constructed containing decompositions of the regression sum of 
squares into extra sums of squares. Table 7.3 contains the ANOVA table decomposition 
for the case of three X variables often used in regression packages, and Table 7.4 contains 
this same decomposition for the body fat example. The decomposition involves single extra 
X variables. 1 


Note that each extra sum of squares involving a single extra X variable has associated 
with it one degree of freedom. The resulting mean squares are constructed as usual. For 
example, MSR(X 2 \Xi) in Table 7.3 is obtained as follows: 


MSRiX^Xi)= 


ssr(x 2 \d 


Extra sums of squares involving two extra X variables, such as SSR(X 2 , X 3 \Xi), have 
two degrees of freedom associated with them. This follows because we can express such 
an extra sum of squares as a sum of two extra sums of squares, each associated with one 



262 Part Two Multiple Linear Regression 


TABLE 7.4 
ANOVA Table 
with 

Source of 
Variation 

55 

df 

MS 

Decomposition 

Regression 

396.98 

3 

132.33 

of SSR — Body 

入 1 

352.27 

1 

352.27 

Fat Example 

X 2 |X! 

33.17 

1 

33.17 

with Three 

入 3 I 入 1/ 入 2 

11.54 

1 

11.54 

Predictor 

Error 

98.41 

16 

6.15 

Variables. 

Total 

495.39 

19 



degree of freedom. For example, by definition of the extra sums of squares, we have: 

SSR(X 2 , X 3 |X,) = ^(X 2 |X,) + ^(X 3 |X,, X 2 ) (7.11) 


The mean square M 57 ?(X 2 , is therefore obtained as follows: 

SSR(X 2 ,X 3 \X i ) 


M^CXo^bIX,) 


2 


Many computer regression packages provide decompositions of SSR into single-degree- 
of-freedom extra sums of squares, usually in the order in which the X variables are entered 
into the model. Thus, if the X variables are entered in the order Xi, X 2 , X 3 , the extra sums 
of squares given in the output are: 


SSR(X { ) 

SSRiXoJX^) 

SSR(X,IX^X.) 

If an extra sum of squares involving several extra X variables is desired, it can be obtained 
by summing appropriate single-degree-of-freedom extra sums of squares. For instance, to 
obtain SSRiXo, X 3 IX 1 ) in our earlier illustration, we would utilize (7.11) and simply add 
SSRiXoJX,) andSSR(X 3 \X u X 2 ). 

If the extra sum of squares SSR{X\, X 3 IX 2 ) were desired with a computer package 
that provides single~degree-of-freedom extra sums of squares in the order in which the X 
variables are entered, the X variables would need to be entered in the order X 2 , Xi, X 3 or 
X 2 , X 3 , Xi. The first ordering would give: 


SSR(X 2 ) 


SSRiX^X,) 

SSR(X 3 \X u X 2 ) 


The sum of the last two extra sums of squares will yield SSR(X t , X 3 IX 2 ). 

The reason why extra sums of squares are of interest is that they occur in a variety 
of tests about regression coefficients where the question of concern is whether certain X 
variables can be dropped from the regression model. We turn next to this use of extra sums of 
squares. 



Chapter 7 Multiple Regression If 263 


7.2 Uses of Extra Sums of Squares in Tests 
for Regression Coefficients_ 


Test whether a Single fik = 0 

When we wish to test whether the term can be dropped from a multiple regression 
model, we are interested in the alternatives: 

fio : 0k =0 

H a - 级 # 0 ■一 


We already know that test statistic (6.51b): 


h 


t 


他} 


is appropriate for this test. 

Equivalently, we can use the general linear test approach described in Section 2.8. We 
now show that this approach involves an extra Sum of squares. Let us consider the first-order 
regression model with three predictor variables: 


i/ ~ ^0 + + P2^i2 + ^3 ^i3 + B i 

To test the alternatives: 

fi 0 ••恥 = 0 


Full model 


( 7 . 12 ) 


( 7 . 13 ) 


we fit the full model and obtain the error sum of squares SSE(F). We now explicitly show 
the variables in the full model, as follows: 


SSE(F) = SSE(X U X 2 , X 3 ) 

The degrees of freedom associated with SSE(F) are df F ~ n ~ A since there are four 
parameters in the regression function for the full model (7.12). 

The reduced model when in (7.13) holds is: 

Yi ~ ^o + + +^' Reduced model ( 7 . 14 ) 

We next fit this reduced model and obtain: 

SSE(R) ^SSEiX^, X 2 ) 

There are dfa ~ n — 3 degrees of freedom associated with the reduced model. 

The general linear test statistic (2.70); 

厂 - SSEiRy— SSEjF ) 二 SSE(F ), 
dfR — dff dfp 

here becomes: 

SSE(X l ,X 2 )-SSE(X l ,X 2 ,X 3 ) 二 SSE(X U X 2 , X 3 ) 

(n — 3) ~ (n — 4) n ~ A 



64 Part Two Multiple Linear Regression 


Example 


Note that the difference between the two error sums of squares in the numerator term is the 
extra sum of squares (7.3a): 


SSE(X U X 2 ) - SSE(X U X 2 , X 3 ) - SSR(X 3 \X l ,X 2 ) 


Hence the general linear test statistic here is: 

SSR(X 3 \X u X 2 ) SSE(X u X 2 , X 3 ) M5i?(X 3 |X,,X2) 




1 


n — 4 


MSE(X u X 2 , X,) 


( 7 . 15 ) 


We thus see that the test whether or not jS 3 = 0 is a marginal test, given that and X 2 
are already in the model. We also note that the extra sum of squares X 2 ) has 

one degree of freedom associated with it, just as we noted earlier. ^ 

Test statistic (7.15) shows that we do not need to fit both the full model and the reduced 
model to use the general linear test approach here. A single computer run can provide a fit 
of the full model and the appropriate extra Sum of squares. 


In the body fat example, we wish to test for thelnodel with all three predictor variables 
whether midarm circumference (X 3 ) can be dropped from the model. The test alternatives 
are those of (7.13). Table 7.4 contains the ANOVA results from a computer fit of the full 
regression model (7.12), including the extra sums of squares when the predictor variables 
are entered in the order X\, X 2 , X 3 . Hence, test statistic (7.15) here is: 

^(X 3 |X,,X 2 ) SSE(X,,X 2 ,X3) 

F = --- E- - 

1 n — 4 


11.54 98.41 


16 


= 1.88 


For a = .01, we require 尸 (.99; 1 ， 16) 二 8.53. Since = 1.88 < 8.53, we conclude Hq, 
that X 3 can be dropped from the regression model that already contains X 1 and X 2 - 
Note from Table 7.2d that the t* test statistic here is: 


b 3 


-2.186 


s{b 3 } 1.596 


=-1.37 


Since (r *) 2 = (― 1.37 ) 2 = 1.88 = F*, we see that the two test statistics are equivalent, just 
as for simple linear regression. 


Comment 

The F* test statistic (7.15) to test whether or not = 0 is called a partial F test statistic to distinguish 
it from the F* statistic in (6.39b) foi - testing whether a// — 0, i.e., whetheror not there is a regression 

relation between Y and the set of X variables. The latter test is called the overall F test. ■ 


Test whether Several ft = 0 

In multiple regression we are frequently interested in whether several terms in the regression 
model can be dropped. For example, we may wish to know whether both ^ 2^2 and ^ 3 X 3 
can be dropped from the full model (7.12). The alternatives here are: 


Ho： i3 2 = h = 0 

H a : not both jS 2 and jS 3 equal zero 


( 7 . 16 ) 



Chapter 7 Multiple Regression II 265 


Example 


With the general linear test approach, the reduced model under Ho is: 

Yi = + PiXn + s t Reduced model ( 7 . 17 ) 

and the error sum of squares for the reduced model is: 


SSE(R) = SSE(X y ) 


This error sum of squares has df R = n~2 degrees of freedom associated with it. 
The general linear test statistic (2.70) thus becomes here: 


p* 


SSE[X\) - SSE(X u X 2 , X 3 ) SSE(X u X 2 , x 3 ) 


(n — 2) — (n — 4) 


n —A 


Again the difference between the two error sums of squares in the numerator term is an 
extra sum of squares, namely: 

SSE(Xi) - SSE(X u X 2i X 3 ) = SSR(X 2t X 3 |X,) 1 


Hence, the test Statistic becomes: 

SSR(X 2 , X 3 |X,) SSE(X u X 2 , X 3 ) MSR(X 2 , X 3 |X,) 


F* 


2 


n—A 


MSE(X U X 2 , X 3 ) 


( 7 . 18 ) 


Note that SSR(X 2 , X^lXi) has two degrees of freedom associated with it, as we pointed out 
earlier. 


We wish to test in the body fet example for the model with all three predictor variables 
whether both thigh circumference (X 2 ) and midarm circumference (X 3 ) can be dropped 
from the full regression model (7.12). The alternatives are those in (7.16). The appropriate 
extra sum of squares can be obtained from Table 7.4, using (7.11): 

SSR(X 2 , X 3 |X,) = SSR(X 2 \X 1 )^ - SSR(X 3 \X l ,X 2 ) 

= 33.17+ 11.54 = 44.71 


Test statistic (7.18) therefore is: 

SSR(X 2 , X 3 |X.) 


f* 


2 


44.71 

2 


6.15 


+ MSE(X,, X 2 , X 3 ) 
3.63 


For a = .05, we require F(.95; 2, 16) = 3.63. Since F* = 3.63 is at the boundary of the 
decision rule (the P-value of the test statistic is .05), we may wish to make further analyses 
before deciding whether X 2 and X 3 should be dropped from the regression model that 
already contains X\. ― 


Comments 

1. For testing whether a single equals zero, two equivalent* test statistics are available: the t* 
test statistic and the F* general linear test statistic. When testing whether several ^ equal zero, only 
the general linear test statistic F* is available. 

% General linear test statistic (2.70) for testing whether several X variables can be dropped 
from the general linear regression model (6.7) can be expressed in terms of the coefficients of 



266 Part Two Multiple Linear Regression 


multiple determination for the full and reduced models. Denoting these by R\ and R\, respectively 
we have: 


F* - 


H 1 -柃 

dfn — djf c]j F 


(7.19) 


Specifically for testing the alternatives in (7J6) for the body fat example, test statistic (7.19) becomes; 






123 


R 




l - R 


y]m 


(/7 — 2) — — 4) n ~ A 


(7.20) 




whei-e 枵 1123 denotes the coefficient of multiple determination when Y is recessed on X 2 , and 
X 3 , and /?^n denotes the coefficient when Y is regressed on Xi alone. 

We see from Table 7.4 that/?J [I23 = 396.98/495.39 = .80135 and R 2 Y ^ =352.27/495.39 = .71110. 
Hence, we obtain by substituting in (7‘20): 




■80135 — .71110 
(20 — 2 ) — (20 — 4) 


.80135 


16 


3.63 


This is the same l-esult as before. Note that corresponds to the coefficient of simple determinariai 
R 2 between Y and X 1 . 

Test statistic (7.19) is not appropriate when the full and reduced regi*ession models do not contain 
the intercept term /Jq. In that case, the general linear test statistic in the form (2.70) must be used. ■ 


7.3 Summary of Tests Concerning Regression Coefficients_ 

We have already discussed how to conduct several types of tests concerning regression 
coefficients in a multiple regression model. For completeness, we summarize here these 
tests as well as some additional types of tests. 


Test whether All ft = 0 

This is the overall F test (6.39) of whether or not there is a regression relation between the 
response variable Y and the set of K variables. The alternatives are: 


Hq ： fit ~ p 2 = ■ - = jSp_i = 0 

H a : not Pk (k ~ 1,p — 1) equal zero 

and the test statistic is: 

p = SSR(X', … ，X p —、) 二 SSEd …， U 
p — 1 n — p 

__ MSR 
= MSE 


( 721 ) 


( 7.22 


If Hq holds, jP* 〜 F(p — \ ,n — p). Large values of F* lead to conclusion H a . 


Chapter 7 Multiple Regression II 267 


Test whether a Single 私 = 0 

This is a partial F test of whether a particular regression coefficient equals zero. The 
alternatives are: 


Ho ： 爲 = 0 

Ha ：^ k ^0 

and the test statistic is: 


(7.23) 


SSR(X k \X' ， … ， X k — U X k+1 , … ， X p —\) • SSE(X { ， … ， Xp—0 

1 J 1 — P 

MSR(X k \X t ，…， U + i, …， X p _0 ’ 

USE 


(7.24) 


If Ho holds, F* ~ F(l, n — p). Large values of F* lead to conclusion H a .iStatistics 
packages that provide extra sums of squares permit use of this test without having to fit the 
reduced model. 

An equivalent test statistic is (6.51b): 


h 


s{b k } 


(7-25) 


If Ho holds, t* ~ t(n — p). Large values of |r*| lead to conclusion H a . 

Since the two tests are equivalent, the choice is usually made in terms of available 
information provided by the regression package output. 


Jest whether Some pk = 0 

This is another partial F test. Here, the alternatives are: 

Ho ： — ^ q+ \ — = 0 

H a : not all of the ^ in Ho equal zero 


(7.26) 




where for convenience, we arrange the model so that the last p — q coefficients are the ones 
to be tested. The test statistic is: 


p SSR(X q ,..., UA， …， m SSE(Xi, … ， X F _z) 

P-Q n-p 

= MSR(X q ， … ， X P _AX U … U 
— MSE 


(7.27) 


If Ho holds, F* ― F(p — q，n — p~y L^irge values of F* lead to conclusion H a . 

Note that test statistic (7.2Z) actually encompasses the two earlier cases. If ^ = 1， the 
test is whether all regression coefficients equal zero. If ^ = /? — 1, the test is whether a 
single regression coefficient equals zero. Also note that test statistic (7.27) can be calculated 
without having to fit the reduced model if the regression package provides the needed extra 
sums of squares: 


= SSR(X q \X u … ， X q ^) + … + SSRiJCp-AXi ， … ， X p - 2 ) ( 7 . 28 ) 



68 Part Two Multiple Linear Regression 


Test statistic (7.27) can be stated equivalently in terms of the coefficients of multiple 
determination for the full and reduced models when these models contain the intercept term 
jSo, as follows: 


F* = 


4 


11 1 






1 一 


p- q 


n 


P 


(7.29) 


where denotes the coefficient of multiple determination when Y is regressed on 

all X variables, and denotes the coefficient when Y is regressed on Xi,..., X q _ x 

only. 


Other Tests 

When tests about regression coefficients are desired that do not involve testing whether one 
or several jS A equal zero, extra sums of squares cannot be used and the general linear test 
approach requires Separate fittings of the full and reduced models. For instance, for the full 
model containing three X variables: >- 

K. = 戶 。 + ^iX, i + A 2 X /2 + 戶 3 X /3 + £,• Full model (7.BO) 

we might wish to test: 

HoH 

(731) 

The procedure would be to fit the full model (7.30), and then the reduced model: 

Y{ = jSo + jSfCX/i + X,. 2 ) + ^ 3 X (-3 + 6{ Reduced model (7.32) 

where jS t - denotes the common coefficient for and j3 2 under H 0 and X,-i + X i2 is the 
corresponding new X variable. We then use the general F* test statistic (2.70) with 1 and 
n — 4 degrees of freedom. 

Another example where extra sums of squares cannot be used is in the following test for 
regression model (7.30): 

Ho ： jS, = 3,^3 = 5 

H a : not both equalities in Ho hold 

Here, the reduced model would be: 

Yj — 3X( I 一 5 X /3 = + 戶 2 义 ,.2 + £/ Reduced model 

Note the new response variable K — 3Xi — 5 X 3 in the reduced model, since ^ 1 X 1 and ) 83 X 3 
are known constants under Hq. We then use the general linear test statistic F* in (2.70) with 
2 and n — 4 degrees of freedom. 


(733) 

(734) 


7.4 Coefficients of Partial Determination_ 一 

Extra sums of squares are not only useful for tests on the regression coefficients of a multiple 
regression model, but they are also encountered in descriptive measures of relationship called 
coefficients of partial determination. Recall that the coefficient of multiple determination, 
R 2 , measures the proportionate reduction in the variation of Y achieved by the introduction 


Chapter 7 Multiple Regression II 269 


of the entire Set of X variables considered in the model. A coefficient of partial determination, 
in contrast, measures the marginal contribution of one X variable when all others are already 
included in the model. 


f W o Predictor Variables 

We first consider a first-order multiple regression model with two predictor variables, as 
given in (6.1): 


Y,- = + PiXa + ^ 2^(2 + 


SSE(X 2 ) measures the variation in Y when X 2 is included in the model. SSE(X it X 2 ) 
measures the variation in Y when both X\ and X 2 are included lii the model. Hence, the 
relative maiginal reduction in the variation in Y associated with X-i when X 2 is already in 
the model is: ^ 

SSE(X 2 ) - SSE(X U X 2 ) _ 

SSE(X 2 ) = SSE(X 2 V 

This measure is the coefficient of partial determination between Y and X \ , given that X 2 is 
in the model. We denote this measure by Ryx\ 2 - 


R 2 y 


l\2 ' 


SSE(X 2 ) - SSE(X U X 2 ) 
SSE(X 2 ) 


娜(挪 2 ) 
SSEiX 2 ) 


(7.35) 


Thus, /?yjj 2 measures the proportionate reduction in the variation in Y remaining after X 2 
is included in the model that is gained by also including Xi in the model. 

The coefficient of partial determination between Y and X 2 , given that Xi is in the model, 
is defined correspondingly: 


n2 观 (x 2 |&) 

尺釋 - SSEiX^ 

(7.36) 

General Case 

■1 so ,• 


The generalization of coefficients of partial determination to three 
the model is immediate. For instance ： 

or more X variables in 

n2 娜 (X 狀 2 ,X 3 ) 

n|23 — SSE(X 2 , X 3 ) 

(7.37) 

n2 SSR(X 2 \X U 

y2U3 ~ SSE(Xi ， X 3 ) 

(7.38) 

- n2 

y3112 _ SSE(Xi ， X 2 ) 

(7.39) 

n2 SSR(X 4 \X U X 2t X 3 ) 

抖 I12T— sSE(X lf X 2 , X 3 ) J 

(7.40) 


Note that in the subscripts to R 2 , the entries to the left of the vertical bar show in turn 
the variable taken as the response and the X variable being added. The entries to the right 
of the vertical bar show the X variables already in the model. 



270 Part Two Multiple Linear Regression 


Example 


For the body fat example, we can obtain a variety of coefficients of partial determination. 
Here are three (Tables 7.2 and 7.4): 


^2|l 


55/?(X 2 |X,) 

~^SEix7T 


33 J 7 
143J2 


=-232 


Rl 



SSR(X ： ,\X U X 2 ) ^ 11.54 

_ SSE(Xu~X 2 ) = m95 


=.105 




SSRjX.lX,) 一 3.47 __ 
SSE(X 2 ) — 一 113.42 … 


We see that when X 2 is added to the regression model containing Xi here, the error sum 
of squares SSE(Xi) is reduced by 23.2 percent. The error Sum of squares for the model 
containing both Xi and X 2 is only reduced by another 10.5 percent when X 3 is added to the 
model. Finally, if the regression model already contains X 2 , adding Xi reduces SSE(X 2 ) 
by only 3.1 percent. 


Comments 

1. The coefficients of partial determination can take on values between 0 and l, as the definitions 
readily indicate. 

2 . A coefficient of partial determi nation can be i nteipreted as a coefficient of si mple determination. 
Consider a multiple regression model with two X variables. Suppose we regress Y on X 2 and obtain 
the residuals: 

CJ( -(F|X 2 )= K,-F ( (X 2 ) 

A 

where Fj CXo) denotes the fitted values of Y when is in the model. Suppose we further regress Xt 
on X 2 and obtain the residuals: 

^[(■^ 1 1^) = — 爻 ii(D 

where X () (X 2 ) denotes the fitted values of X| in the i.egression of X 1 on X 2 .The coefficient of simple 
determination R 2 between these two sets of residuals equals the coefficiem of partial deteimination 
Ry\\ 2 - Thus, this coefficient measures the relation between F and Xi when both of these variables 
have been adjusted for their linear relationships to Xi. 

3. The plot of the residuals t j , (K | Xi) against ^(-^i 1-^2) provides a graphical repi-esen ration of the 

strength of the relationship between Y and adjusted for X 2 . Such plots of i-esiduals, called added 
variable plots qv partial regression plots, are discussed in Section 10.1. ® 

Coefficients of Partial Correlation 

The square root of a coefficient of partial determination is called a coefficient of partial 
correlation. It is given the same sign as that of the corresponding regression coeffidentinthe 
fitted regression function. Coefficients of partial correlation are frequently used in practice! 
although they do not have as clear a meaning as coefficients of partial determination. One 
use of partial correlation coefficients is in computer routines for finding the best predictor 
variable to be selected next for inclusion in the regression model. We discuss this use in 
Chapter 9. 



Chapter 7 Multiple Regression II 271 


- For the body fat example, we have: 

一一 - — ~ r Y2 \i = V^232 = 482 

ry3H2 — _ *\/:105 = 一 .324 
ryi|2 = a/ 031 = .176 

Note that the coefficients ry 2 \\ and ry \\2 are positive because we See from Table 7.2c that 
f? 2 = .6594 and b y = .2224 are positive. Similarly, ryi \\2 is negative because we see from 
Table 7.2d that Z? 3 = —2.186 is negative. 


Comment 

Coefficients of partial determination can be expressed in terms of simple or other partial correlation 
coefficients. For example: ^ 



= [^2||] 2 = 


(r Y 2-ri 2 rYi) 2 
(! - r n) (! - r Y\) 


^Y2\\3 = \- r Y2\\3V 


(f"Y2\3 ~ ^I2|3^yi|3) 2 
(1 一 r ?2|3) (1 - 旧 ) 


(741) 

(7.42) 


where r Y \ denotes the coefficient of simple correlation between Y and X\,r t2 denotes the coefficient 
of simple correlation between X , and X 2 , and so on. Extensions are straightforward. ■ 


7:5 Standardized Multiple Regression Model_ 

A standardized form of the general multiple regression model (6.7) is employed to control 
roundoff errors in normal equations calculations and to permit comparisons of the estimated 
regression coefficients in common units. 

Roundoff Errors in Normal Equations Calculations 

The results from normal equations calculations can be sensitive to rounding of data in 
intermediate stages of calculations. When the number of X variables is small — say, three 
or less — roundoff effects can be controlled by carrying a sufficient number of digits in 
intermediate calculations. Indeed, most computer regression programs use double-precision 
arithmetic in all computations to control roundoff effects. Still, with a large number of 
X variables, serious roundoff effects can arise despite the use of many digits in intermediate 
calculations. 

Roundoff errors tend to enter normal equations calculations primarily when the inverse 
of X’X is taken. Of course, any errors in (X'X) _1 may be magnified in calculating b and 
other subsequent statistics. The danger of serious roundoff errors in (X’X) _1 is particularly 
great when (1) X’X has a determinant that is close to zero and/or (2) the elements of X’X 
differ substantially in order of magnitude. The first condition arises when some or all of the 
X variables are highly intercoirelated. We shall discuss this situation in Section 7.6. 

The Second condition arises when the X variables have substantially different magnitudes 
so that the entries in the X’X matrix cover a wide range, say, from 15 to 49,000,000. A 
solution for this condition is to transform the variables and thereby reparameterize the 
^ regression model into the standardized regression model. 



272 Part Two Multiple Linear Regression 


The transformation to obtain the standardized regression model, called the correlation 
transformation, makes all entries in the X'X matrix for the transformed variables fall between 
—l and 1 inclusive, so that the calculation of the inverse matrix becomes much less subject 
to roundoff errors due to dissimilar orders of magnitudes than with the original variables 


Comment 

In order to avoid the computational difficulties inherent in inverting the X'X matrix, many statistical 


packages use an entirely different computational approach that involves decomposing the X matrix into 
a product of several matrices with special properties. The X matrix is often first modified by centering 
each of the variables (i.e., using the deviations around the mean) to further improve computational 


accuracy. Information on decomposition strategies may be found in texts on statistical computing 
such as Reference 7.1. _ 


Lack of Comparability in Regression Coefficients 

A second difficulty with the nonstandai'dized multiple regression model (6.7) is that ordinar¬ 
ily regression coefficients cannot be compared because of differences in the units involved. 
We cite two examples. 

1. When considering the fitted response function: 

K = 200 + 20,000X! + .2X 2 

one may be tempted to conclude that X t is the only important predictor variable, and that 
X 2 has little effect on the response variable Y. A little reflection should make one wary of 
this conclusion. The reason is that we do not know the units involved. Suppose the units are: 

Y in dollars 

X s in thousand dollars 

X 2 in cents 

In that event, the effect on the mean response of a $1,000 increase in Xi (i.e., a 1-unit 
increase) when Xo is constant would be an increase of $20,000. This is exactly the same 
as the effect of a $1,000 increase in X 2 (i.e., a 100,000-unit increase) when X t is constant, 
despite the difference in the regression coefficients. 

2. In the Dwaine Studios example of Figure 6.5, we cannot make any comparison be¬ 
tween b t and b 2 because Xi is in units of thousand persons aged 16 or younger, whereas 
X 2 is in units of thousand dollars of per capita disposable income. 


Correlation Transformation 

Use of the correlation transformation helps with controlling roundoff errors and, by express¬ 
ing the regression coefficients in the same units, may be of help when these coefficients 
are compared. We shall first describe the correlation transformation and then the resulting 
standardized regression model. 

The correlation transformation is a simple modification of the usual standardization of a: 
variable. Standardizing a variable, as in (A.37), involves centering and scaling the variable^ 
Centering involves taking the difference between each observation and the mean of _ 
observations for the variable; scaline involves expressing the centered observations in un 鸣 
of the standard deviation of the observations for the variable. Thus, the usual standardizations 



Chapter 7 Multiple Regression II 273 


of the response variable Y and the predictor variables Xu …， X p —\ are as follows: 


Yj~Y 

Sy 

乂汰 — Xk 
Sk 


(k ~ 1,..., p - l) 


(7.43a) 


(7.43b) 


where Y and are the respective means of the Y and the Xk observations, and Sy and 
are the respective standard deviations defined as follows: 


Sy = 


■ii 


E(U 


n 


(7.43c) 


s k = 


Yldk — 又 k~y 


n—\ 


(灸 =1 ， _ . . ， jP — 1) 


(7.43d) 


The correlation transformation is a simple function of the standardized variables in 
(7.43a, b): 


Y ； 


人 ik 


Yt - Y 


) 


-Jn — 1 V s y 

X lk - x k 


y/n — 1 


Sk 


) 


(k — 1， ... ， p — 1) 


(7.44a) 


(7.44b) 


Standardized Regression Model 

The regression model with the transformed variables Y* and as defined by the correlation 

transformation in (7.44) is called a standardized regression model and is as follows: 


二 十 •.. + + s; (7.45) 


The reason why there is no intercept parameter in the standardized regression model (7.45) is 
that the least squares or maximam likelihood c 沿 culations always would lead to an estimated 
intercept term of zero if an intercept parameter were present in the model. 

It is easy to show that the parameters fi*, ..., in the standardized regression model 
and the original parameters A), 卢 i ， ... ， P p ^i in the ordinary multiple regression model (6.7) 
are related as follows: 


Pk = 



(k = 1,..., p — 1) 


(7.46a) 


jSo = F — jSiXi —...— 泛 p 一 i 文 p -i (7.46b) 

We see that the standardized regression coefficients and the original regression coeffi¬ 
cients Pk (k=l,..., p~\) are related by simple scaling factors involving ratios of standard 
deviations. 



274 Part Two Multiple Linecir Regression 


X'X Matrix for Transformed Variables 


In order to be able to study the special nature of the X'X matrix and the least squares norm^j 
equations when the variables have been transformed by the con-elation transformation w e 
need to decompose the correlation matrix in (6.67) containing all pairwise correlation coef 
ficients among the response and predictor variables Y,X\, Xo,..., X p ^ Y into two matrices 

1. The first matrix, denoted by r X x ， is called the correlation matrix of the X variables It 
has as its elements the coefficients of simple correlation between all pairs of the X 
This matrix is defined as follows: 


^xx 


… r 2.,,-l 


(7.47) 


一 1.1 ^p-I.2 


Here, /， 12 again denotes the coefficient of simple correlation between Xi and X 2 , and so 
on. Note that the main diagonal consists of is because the coefficient of simple correladon 
between a variable and itself is i. The correlation matrix rxx is symmetric; remember that 
r kk .二 r k - k . Because of the symmetry of this matrix, computer printouts frequently omit (he 
lower or upper triangular block of elements. 

2. The second matrix, denoted by r^x, is a vector containing the coefficients of simile .： 
correlation between the response variable K and each of the X variables, denoted agmnby 
ry\, r Y2 , etc.: : 




(7.48)、 


fy-p-K 


Now we are ready to consider the X'X matrix for the transformed variables in the ； 
standardized regression model (7.45). The X matrix here is: ； 5 



Remember that the standardized regression model (7.45) does not contain an intercept term ； 
hence, there is no column of l s in the X matrix. It can be shown that the X’X matrix for th^ 
transformed variables is simply the correlation matrix of the X variables defined in (7.47| 

XX = r xx (7_50 

Since the X'X matrix for the transformed variables consists of coefficients of correlatio 
between the X variables, all of its elements are between 〜 1 and l and thus are of ^ 
same order of magnitude. As we pointed out earlier, this can be of great help in controlIiD 
roundoff errors when inverting the X'X matrix. 



Chapter 7 Multiple Regression II 275 


Comment 

We illustrate that the X'X matrix for the transformed variables is the correlation matrix of the X 
variables by considering two entries in the matrix: 


In the upper left comer of X'X we have: 






\^}n — 1 si 

2. In the first row, second column of X'X, we have: 


y^x^n -文 A 
n — 1 




m-E(MXM) 

二 1 E ( 知 - 又 i)(x i 2 ~x 2 ) 

n — 1 

, = E ( 知〜 x t )(x f2 - x 2 ) 

■ 二 故二 ^] 7 ’ 2 

But this equals r i2 , the coefficient of correlation between and X 2 , by (2.84). 


■i 士 


■ 


Estimated Standardized Regression Coefficients 

The least squares norm 沿 equations (6.24) for the ordinary multiple regression model: 

X Xb 二 X Y 


and the least squares estimators (6.25): 

b = ocxy^Y 

can be expressed simply for the transformed variables. It can be shown that for the trans¬ 
formed variables, X’Y becomes: 

XY = r yx (7.51) 

(P—1)X1 

where r Y x is defined in (7.48) as the vector of the coefficients of simple correlation between 
Y and each X variable. It now follows from (7.50) and (7.51) that the least squares nor¬ 
mal equations and estimators of the regression coefficients of the standardized regression 
model (7.45) are as follows: 




rxxb = r YX 

( 

(7.52a) 



b — r xx r Y^ 


(7.52b) 


where: 



a 


t-V" v? 



r* # ^ 

K 

> 


■ 、 


b = 

(p’l)xl 

b* 2 


(7.52c) 




_ b U. 




"Pie regression coefficients b\,..., b*^ are often called standardized regression 
coefficients. 



276 Part Two Multiple Linear Regression 


The return to the estimated regression coefficients for regression model (6.7) i n the 
original variables is accomplished by employing the relations: 


h — (f ) % (/: = 1, …, p ~ 1) 

bo = Y — b\X\ —… —bp^iXp^i 


(7.53a) 

(7.53b) 


Comment 

When there are two X variables in the regression model, i.e., when p — 1 = 2, we can readily see uie 
algebraic form of the standardized regression coefficients. We have: 




rxx 


^yx — 


r xx = 


厂 12 


r t 2 

rn 

r Y2 


— r 


12 


- 厂 12 


-^12 


(7.54a) 


(7.54b) 


(7.54c) 


Hence, by (7.52b) we obtain: 

b 

Thus: 


1 

1 

-r n 


尸 n 


i 

厂 n 

一 r n r Y2 

1 — r ^2 

— 厂 12 

1 


^Y2 

^ r 

一 r ?2 

^¥2 

- r\2ry\ 


— 


r Y\ ~ ^\2>"Y2 


r 


2 

12 


r ¥ 2 - ri 2 rn 
1 ^ r \i 


(7.55) 

(7.55a) 

(7.55b) 

■ 


Example 


Table 7.5a repeats a portion of the original data for the Dwaine Studios example in Fig¬ 
ure 6.5b, and Table 7.5b contains the data transformed according to the correlation trans¬ 
formation (7.44). We illustrate the c^culation of the transformed data for the first case, 
using the means and standard deviations in Table 7.5a (differences in the last digit of the 
transformed data are due to rounding effects): 


Yl 


i ( 


iW 

Sy 

174.4 — 181.90’ 




y 氺 
A 12 


V2T^T V 36.191 
一 .04634 

Xyz — 文 2 


yfn — 1 

1 

•\/21 — 
.07783 


X u -X, 

S\ 

68.5 — 62.019’ 
18.620 


■s/n — l 






) -\^21 — 1 


16.7 - 17.143 
"^97035 


) = — • 


10208 



Chapter 7 Multiple Regression II 277 


TABLE 7.5 

Correlation 
^ansforma- 
tion and Fitted 
Standardized 
Regression 

Model — 
j}waiDe Studios 
Example. 


f 、〆 - 11 

“ ， - 

! 

(a) Original Data 




Target 

Per Capita 

Case 

Sales 

Population 

Disposable Income 

/ 

Y, 


m 

1 

1 丸 4 

68.5 

16.7 

2 

164.4 

4i.2 

獅 

20 

224.1 

82.7 

19.1 

21 

V66.5 

52:3 

iao 


Y = 181.90 

Xt = 62.0T9 

X 2 = 17.143 


Sy 36.191 

S! =18.620 

s 2 = ：97035 


(b) Transfontied Data 


7 

- - - .?• -- 

rr 



1, 

/ -.04637 

.07783 

-.10205 

2 

-.10815 

-v 

-.20198 

-.07901 

• *.* 

20 

.26070 

.24835 

.45100 

21 

-.69518 

-.11671 

^.263^6 


(c) Fitted Standardized Model 



= .7484^-f .2511X1 

r ： ^ •- 


When fitting the standardized regression model (7.45) to the transformed data, we obtain 
the fitted model in Table 7.5c: 


f* = .7484Xf + .2511g 


The standardized regression coefficients b{ = .7484 and = .2511 are shown in the 
SYSTAT regression output in Figure 6.5a on page 237, labeled STD COEF. We see from 
the standardized regression coefficients that an increase of one standard deviation of Xi 
(target population) when X 2 (per capita disposable income) is fixed leads to a much larger 
increase in expected sales (in units of standard deviations of Y) than does an increase of 
one standard deviation of X 2 when Xi is fixed. . 

To shift from the standardized regression coefficients b* and back to the regression 
coefficients for the model with the original variables, we employ (7.53). Using the data in 
Table 7.5, we obtain: 


h = 

bi = 



36.191 

18.620 


(.7484) = 1.4546 


36.191 

.97035 


(.2511) 


= 9.3652 


bo = Y- b x X x - = 181.90- 1.4546(62.019) - 9.3652(17.143) = -68.860 



278 Part Two Multiple Linear Regression 


The estimated regression function for the multiple regression model in the original variables 
therefore is: 

Y = -68.860+ 1.455X, + 9.365X 2 

This is the same fitted regression function we obtained in Chapter 6, except for slight 
rounding effect differences. Here, b\ and bo cannot be compared directly because Xi is i n 
units of thousands of persons and X 2 is in units of thousands of dollars. 

Sometimes the standardized regression coefficients Z?* = .7484 and b*_ = .2511 are in¬ 
terpreted as showing that target population (Xi) has a much greater impact on sales than 
per capita disposable income (X?) because b* is much larger than b^. However, as we^wili 
see in the next section, one must be cautious about interpreting any regression coefficient, 
whether standardized or not. The reason is that when the predictor variables are correlated 
among themselves, as here, the regression coefficients are affected by the other predictor 
variables in the model. For the Dwaine Studios data, the correlation between X, and X 2 is 
r x2 = .781, as shown in the correlation matrix in Figure 6.4b on page 232. 

The magnitudes of the standardized regression coefficients are affected not only by 
the presence of correlations among the predictor variables but also by the spacings of the 
observations on each of these variables. Sometimes these spacings may be quite arbitrary. 
Hence, it is ordinarily not wise to interpret the magnitudes of standardized regression 
coefficients as reflecting the comparative importance of the predictor variables. 

Comments 

1 • Some computer packages present both the regression coefficients b k for the model in the original 
variables as well as the standardized coefficients b^, as in the SYSTAT output in Figure 6.5a. The 
standardized coefficients are sometimes labeled beta coefficients in printouts. 

2. Some computer printouts show the magnitude of the determinant of the correlation matrix of 
the X variables. A near-zero value for this determinant implies both a high degree of linear associaticxi 
among the X variables and a high potential for roundoff errors. For two X variables, this determinant 
is seen from (7.54) to be 1 — rj 2 , which approaches 0 as r^ 2 approaches 

3. It is possible to use the cors'elation transformation with a computer package that does not 
permit regression through the origin, because the intercept coefficient will always be zero for data 
so transformed. The other regression coefficients will also be correct. 

4. Use of the standardized variables (7.43) without the correlation transformation modifica¬ 

tion in (7.44) will lead to the same standardized regression coefficients as those in (7.52b) for the 
correlation-transformed variables. However, the elements of the X'X matrix will not then be bounded 
between — 1 and 1. ■ 

7.6 Multicollinearity and Its Effects_ 

In multiple regression analysis, the nature and significance of the relations between the 
predictor or explanatory variables and the response variable are often of particular interest 
Some questions frequently asked are: 

1. What is the relative importance of the effects of the different predictor variables? 

2. What is the magnitude of the effect of a given predictor variable on the response variable? 

3. Can any predictor variable be dropped from the model because it has little or no effect 
on the response variable? 



Chapter 7 Multiple Regression II 279 


4. Should any predictor variables not yet included in the model be considered for possible 

inclusion? 

If the predictor variables included in the model are (1) uncorrelated among themselves 
and (2) uncorrelated with any other predictor variables that are related to the response 
variable but are omitted from the model, relatively simple answers can be given to these 
questions. Unfortunately, in many nonexperimental situations in business, economics, and 
the social and biological sciences，the predictor or explanatory variables tend to be correlated 
among themselves and with other variables that are related to the response variable but are 
not included in the model. For example, in a regression of family food expenditures on 
the explanatory variables family income, family savings, and age of head of household, 
the explanatory variables will be correlated among themselves. Further, they will also be 
correlated with other socioeconomic variables not included in the model that do affect 
family food expenditures, such as family size. 

When the predictor variables are correlated among themselves, intercorrelation or multi- 
collinearity among them is said to exist. (Sometimes the latter term is reserved for those 
instanced when the correlation among the predictor variables is very high.) We shall explore 
a variety of interrelated problems created by multicollinearity among the predictor variables. 
First, however, we examine the situation when the predictor variables are not correlated. 


Uncorrelated Predictor Variables 

Table 7.6 contains data for a small-scale experiment on the effect of work crew size 
and level of bonus pay (X 2 ) on crew productivity (F). The predictor variables Xi and X 2 are 
uncorrelated here, i.e., rf 2 = 0, where rf 2 denotes the coefficient of simple determination 
between Xi and X 2 . Table 7.7a contains the fitted regression function and the analysis of 
variance table when both Xi and X 2 are included in the model. Table 7.7b contains the same 
information when only Xi is included in the model, and Table 7.7c contains this information 
when only X 2 is in the model. 

An important feature to note in Table 7.7 is that the regression coefficient for Xi, b\ = 
5.375, is the same whether only Xi is included in the model or both predictor variables are 
included. The same holds for b 2 = 9.250. This is the result of the two predictor variables 
being uncorrelated. 


TABLE 7.6 

Uncorrelated 

Predictor 

Case 

Crew Size 

Bonus Pay 
(dollars) * 

Crew Productivity 

Variables — 

1 

Xn 

Xa 

Y t 

Work Crew 

1 

4 

2 

" 42 

Productivity 

2 

4 

2 

39 

Example. 

3 

4 

3 

48 " 



280 Part Two Multiple Linecir Regression 


TABLE 7.7 
Regression 
Results when 
Predictor 
Variables Are 
Uncorrelated — 
Work Crew 
Productivity 
Example. 



(a) Regression of Y on Xi and X 2 
? = .375 + 5.375 Xt + 9.250 入 2 


Source of 




Variation 

55 

df 

MS 

Regression 

402.250 

2 

201.125 

Error 

17.625 

5 

3.525 

Total 

419.875 

7 



(b) Regression of Y on Xi 



Y = 23.500 +5.375 Xi 



Source of 




Variation 

55 

df 

MS 

Regression 

231.125 

1 

231.125 

Error 

188.750 

6 

31.458 

Total 

419.875 




(c) Regression of Y on Kz 
? = 27.250 + 9.250X2 


Source of 

Variation 55 df MS 


Regression 171.125 1 

Error 248.750 6 

Total 419.875 7 


171.125 
41.458 


Thus, when the predictor variables are uncorrelated, the effects ascribed to them by a 
first-order regression model are the same no matter which other of these predictor variables 
are included in the model. This is a strong argument for controlled experiments whenever 
possible, since experimental control permits choosing the levels of the predictor variables 
so as to make these variables uncorrelated. 

Another important feature of Table 7.7 is related to the error sums of squares. Note from 
Table 7.7 that the extra sum of squares SSR(Xi\X 2 ) equals the regression sum of squares 
SSR(Xi) when only X i is in the regression model: 

SSRiX^Xj) = SSE(X 2 ) - SSE(X U X 2 ) 

= 248.750- 17.625 二 231.125 
SSR(D 二 231.125 

Similarly, the extra sum of squares SSR{X 2 \Xi) equals SSR{X 2 ), the regression sum of 
squares when only X 2 is in the regression model: 

SSRiXiJXi) = SSE{Xd - SSE(X l , X 2 ) 

=188.750 — 17.625 = 171.125 
SSR(X 2 ) = 171.125 



Chapter 7 Multiple Regression II 281 


In general, when two or more predictor variables are uncorrelated, the margin 沿 contribu¬ 
tion of one predictor variable in reducing the error sum of squares when the other predictor 
variables are in the model is exactly the same as when this predictor variable is in the model 
alone. 


Comment 

To show that the regression coefficient of is unchanged when X 2 is added to the regression model 
in the case where Xi and X 2 are uncorrelated, consider the following algebraic expression for b\ in 
the first-order multiple regression model with two predictor variables: 


~"E (兄 1 — 叉 i) 2 ~~ 


E(m 2 


r YlX 12 


(7.56) 


where, as before, r Y i denotes the coefficient of simple correlation between Y and X 2 , and r \2 denotes 
the coefficient of simple correlation between X\ and X 2 . 

If Xi and X 2 are uncorrelated, r l2 = 0, and (7.56) reduces to: 


E (兄 

E(& - A) 2 


when ri 2 


(7.56a) 


But (7.56a) is the estimator of the slope for the simple linear regression of Y on Zi, per (1.10a). 

Hence, when Xi and X 2 are uncorrelated, adding X 2 to the regression model does not change the 
regression coefficient for Xi； correspondingly, adding Xi to the regression model does not change 
the regression coefficient for X 2 - ■ 


Nature of Problem when Predictor Variables Are Perfectly Correlated 

To see the essentia nature of the problem of multicollinearity, we shall employ a simple 
example where the two predictor variables are perfectly correlated. The data in Table 7.8 
refer to four sample observations on a response variable and two predictor variables. Mr. A 
was asked to fit the first-order multiple regression function: 

= + (7.57) 


TABLE 7.8 

Example of 

Perfectly 

Correlated 

Predictor 

Variables. 




282 Part Two Multiple Linear Regression 


FIGURE 7.2 
Two Response 
Planes That 
Intersect when 
JT 2 =5+.5 Xi. 



He returned in a short time with the fitted response function: 

f 二一 87 + A + 18X 2 (7.58) 

He was proud because the response function fits the data perfectly. The fitted values are 
shown in Table 7.8. 

It so happened that Ms. B also was asked to fit the response function (7.57) to the same 
data, and she proudly obtained: 

Y = 一 7 + 9A +2X 2 (7.59) 

Her response function also fits the data perfectly, as shown in Table 7.8. 

Indeed, it can be shown that infinitely many response functions will fit the data in 
Table 7.8 perfectly. The reason is that the predictor variables X\ and Xo are perfectly 
related, according to the relation: 


= 5 十 (7.60) 

Note that the fitted response functions (7.58) and (7.59) are entirely different response 
surfaces, as may be seen in Figure 7.2. The two response surfaces have the same fitted 
values only when they intersect. This occurs when Xi and Xj follow relation (7.60), i.e., 
when X? = 5 + .5X[. 

Thus, when X 1 and X 2 are perfectly related and, as in our example, the data do not 
contain any random error component, many different response functions will lead to the 
same perfectly fitted values for the observations and to the same fitted values for any 
other (Xi, X?) combinations following the relation between X 1 and X 2 . Yet these response 
functions are not the same and will Lead to different fitted values for (Xi, X?) combinations 
that do not follow the relation between X 1 and Xj. 

Two key implications of this example are: 

].The perfect relation between X 1 and X 2 did not inhibit our ability to obtain a goodfif 
to the data. 



Chapter 7 Multiple Regression II 283 


2. Since many different response functions provide the same good fit, we cannot interpret 
any one set of regression coefficients as reflecting the effects of the different predictor 
variables. Thus, in response function (7.58)， b x ~ \ and = 18 do not imply that X 2 is the 
key predictor variable and X\ plays little role, because response function (7.59) provides 
an equally good fit and its regression coefficients have opposite comparative magnitudes. 

Effects of Multicollinearity 

In practice, we seldom find predictor variables that are perfectly related or data that do not 
contain some random error component. Nevertheless, the implications just noted for our 
idealized example still have relevance. v 

1. The fact that some or all predictor variables are correlated among themselves does 
not, in general, inhibit our ability to obtain a good fit nor does it tend to affect inferences 
about mean responses or predictions of new observations, provided these inferences are 
made within the region of observations. (Figure 6.3 on p. 231 illustrates the concept of the 
region of observations for the case of two predictor variables.) 

2. The counterpart in real life to the many different regression functions providing equally 
good fits to the data in our idealized example is that the estimated regression coefficients tend 
to have large sampling variability when the predictor variables are highly correlated. Thus, 
the estimated regression coefficients tend to vary widely from one sample to the next when 
the predictor variables are highly correlated. As a result, only imprecise information may 
be available about the individual true regression coefficients. Indeed, many of the estimated 
regression coefficients individually may be statistically not significant even though a definite 
statistical relation exists between the response variable and the set of predictor variables. 

3. The common interpretation of a regression coefficient as measuring the change in the 
expected value of the response variable when the given predictor variable is increased by 
one unit while all other predictor variables are held constant is not fully applicable when 
multicollinearity exists. It may be conceptually feasible to think of varying one predictor 
variable and holding the others constant, but it may not be possible in practice to do so 
for predictor variables that are highly correlated. For example, in a regression model for 
predicting crop yield from amount of rainfall and hours of sunshine, the relation between the 
two predictor variables makes it unre^istic to consider varying one while holding the other 
constant. Therefore, the simple interpretation of the regression coefficients as measuring 
marginal effects is often unwarranted with highly correlated predictor variables. 

We illustrate these effects of multicollinearity^y returning to the body fat example. A 
portion of the basic data was given in Table 7.1，and regression results for different fitted 
models were presented in Table 7.2. Figure 7.3 contains the scatter plot matrix and the 
correlation matrix of the predictor variables. It is evident from the scatter plot matrix that 
predictor variables X 1 and X 2 are highly correlated; the correlation matrix of theX variables 
shows that the coefficient of simple correlation is rn = .924. On the other hand, X 3 is not so 
highly related to Xi and X 2 individually; the correlation matrix shows that the correlation 
coefficients are ri 3 二 .458 and r 2 3 = .085. (But X 3 is highly correlated with Xi and X 2 
together; the coefficient of multiple determination when X 3 is regressed on Xi and X 2 
is .998.) 



284 Part Two Multiple Linear Regression 


FIGURE 7.3 
Scatter Plot 
Matrix and 
Correlation 
Matrix of the 
Predictor 
Variables — 
Body Fat 
Example. 


(a) Scatter Plot Matrix of X Variables 






n §> 

« 

XI 





« ^ * 

« 

«■« 


a * 

* «r 



•* 

e 

•t d 

XL 

» * * 

* » 

n 

« 


V 

d 

« 

* 老 

_ 

令 

* * * 




兔 ， • *• 


X3 

P 

« 


* ® ° 

令 

0 ® * * 



(b) Correlation Matrix of X Variables 


1.0 .924 .458 

.924 1.0 .085 

.458 .085 1.0 


Effects on Regression Coefficients. Note from Table 7.2^that the regression coefficient 
for X|, triceps skinfold thickness, varies markedly depending on which other variables are 
included in the model: 


Variables in Model 

^1 

bz 

入 1 

.8572 

— 

入 2 

一 

.8565 

X ly x 2 

.2224 

.6594 

久 U 入 2, 入 3 

4.334 

-2.857 


The story is the same for the regression coefficient for X 2 - Indeed, the regression co¬ 
efficient bt even changes sign when X 3 is added to the model that includes X, and X 2 . 

The important conclusion we must draw is: When predictor variables are cori'elated, the 
regression coefficient of any one variable depends on which other predictor variables are 
included in the model and which ones are left out. Thus, a regression coefficient does not 
reflect any inherent effect of the particular predictor variable on the response variable but 
only a marginal or partial effect, given whatever other correlated predictor variables are 
' included in the model. 

Comment 

Another illustration of how intercorrefated predictor variables that are omitted from the regression 
model can influence the regression coefficients in the regression model is provided by an analyst who 
was perplexed about the sign of a regression coefficient in the fitted regression model. The analyst had 
found in a regression of territory company sales on terdtory population size, per capita income, and 
some other predictor variables that the regression coefficient for population size was negative, and this 
conclusion was supported by a confidence interval for the regression coefficient. A consultant noted 
that the analyst did not include the major competitor’s market penetration as a predictor variaUe in 
the model. The competitor was most active and effective in territories with large populations, thereby 



Chapter 7 Multiple Regression II 285 


keeping company sales down in these territories. The result of the omission of this predictor variable 
from the model was a negative coefficient for the population size variable. ■ 

Effects on Extra Sums of Squares. When predictor variables are correlated, the marginal 
contribution of any one predictor variable in reducing the error sum of squares varies, 
depending on which other variables are already in the regression model, just as for regression 
coefficients. For example, Table 7.2 provides the following extra sums of squares for Xi ： 

SSR(XO = 352.27 
SSRiX^) = 3.47 

The reason why 5 , 5 , /?(Xi|X 2 ) is so small compared with SSRd) is that and X2 are 
highly correlated with each other and with the response variable. Thus, when X 2 is 沿 ready 
in the regression model, the marginal contribution of Xi in reducing the error sum of squares 
is comparatively small because X 2 contains much of the same information as Xi. 

The same story is found in Table 7.2 for X 2 . Here SSR(X 2 \Xi) = 33.17, which is much 
smaller than SSR(X 2 ) = 381.97. The important conclusion is this: When predictor variables 
are correlated, there is no unique sum of squares that can be ascribed to any one predictor 
variable as reflecting its effect in reducing the total variation in Y. The reduction in the 
total variation ascribed to a predictor variable must be viewed in the context of the other 
correlated predictor variables already included in the model. 


Comments 

1. Multicollinearity also affects the coefficients of partial determination through its effects on the 
extra sums of squares. Note from Table 7.2 for the body fat example, for instance, that X 1 is highly 
correlated with Y : 

— SSRjXj) — 352.27 
yi ~ SSTO ~ 495.39 _ ' 

However, the coefficient of partial determination between Y and Xu when X 2 is already in the 
regression model, is much smaller: 


^112 = 


SSRd\X 2 ) 

SSE{X 2 ) 


3.47 

113.42 


=.03 


The reason for the small coefficient of partial determination here is, as we have seen, that Xi and 
X 2 are highly correlated with each other and with the response variable. Hence, Xi provides only 
relatively limited additional information beyond that furnished by X 2 . 

2. The extra sum of squares for a predictor variable after other correlated predictor variables are 
in the model need not necessarily be smaller than before these other variables are in the model, as we 
found in the body fat example. In special cases, it can be larger. Consider the following special data 


set and its correlation matrix: 

Y X x X 2 
20 5 25 

20 10 30 

0 5 5 

1 10 10 



Y 

* 


x 2 

Y 

■1.0 

.026 

.976' 

A 

p 

1.0 

.243 




1.0 _ 



286 Part Two Multiple Linear Regression 


Variables in Model 


5{bT> s{b 2 } 

.1288 — 

— .1100 

.3034 .2912 

3.016 2.582 


Again, the high degree of multicollmearity among the predictor variables is responsible for 
the inflated variability of the estimated regression coefficients. 

Effects on Fitted Values and Predictions. Notice in Table 7.2 for the body fat example 
that the high multico I linearity among the predictor variables does not prevent the mean 
square error, measuring the variability of the error terms, from being steadily reduced as 
additional variables are added to the regression model: 

Variables in Model MSE 

X! 7.95 

X, t X 2 6.47 

X 2/ X 3 6.15 

Furthermore, the precision of fitted values within the range of the observations on the 
predictor variables is not eroded with the addition of correlated predictor variables into 
the regression model. Consider the estimation of mean body fat when the only predictor 
variable in the model is triceps skinfold thickness (Xi) for X/,i = 25.0. The fitted value 
and its estimated standard deviation are (calculations not shown): 

Y h = 19.93 s{Y h }^.^2 

When the highly correlated predictor variable thigh circumference (Xo) is also included 
in the model, the estimated mean body fat and its estimated standard deviation areas follows 


Here, Y and Xo are highly positively correlated, but Y and A| are practically uncorrelated. In additj 0 
X\ and X 2 are moderately positively correlated. The extra sum of squares for X i when it is the only 
variable in the model for this data set is SSR(X\) — .25, but when X 2 already is in the model the extra 
sum of squares is SSR(X'\X 2 ) = 18.01. Similarly, we have for these data: 

SSR(X 2 ) = 362.49 55/?(X 2 |Xi) = 380.25 

The increase in the extra sums of squares with the addition of the other predictor variable in the model is 
related to the special situation here that X\ is practically unconelated with Y but moderately correlated 
with Xn, which in turn is highly correlated with Y. The general point even here still holds — the extra 
sum of squares is affected by the other correlated predictor variables already in the model. 

When 咖 ( 久说） > SSR(Xi), as in the example just cited, the variable Xi is sometimes called 
a suppressor variable. Since ( 久 2 H) > SSR(X 2 ) in the example, the variable X\ would also be 
called a suppressor variable. — _ 

Effects on s[b k ). Note from Table 7.2 for the body fat example how much more imprecise 
the estimated regression coefficients b\ and bj become as more predictor variables are added 
to the regression model: 


3 

2 2 ^ 
X X 

/ / 
1 2 1 

X X X X 



Chapter 7 Multiple Regression II 287 


for X h \ = 25.0 and X ^2 — 50.0： 

Y h ^ 19.36 5(^} = .624 

Thus, the precision of the estimated mean response is equally good as before, despite the 
addition of the second predictor variable that is highly correlated with the first one. This 
stability in the precision of the estimated mean response occurred despite the fact that the 
estimated standard deviation of b\ became substantially lai^er when X-i was added to the 
model (Table 7.2). The essential reason for the stability is that the covariance between b\ 
and b 2 is negative, which plays a strong counteracting influence to the increase in } in 
determining the value of s 2 {f k ) as given in (6.79). 

When all three predictor variables are included in the model, the estimated mean body 
fat and its estimated standard deviation are as follows for X h i = 25,0, Xhi = 50.0, and 
Xh3 二 29.0: 

Y h = 19.19 = .621 1 

Thus, the addition of the third predictor variable, which is highly correlated with the first two 
predictor variables together, also does not materially affect the precision of the estimated 
mean response. 


Effects on Simultaneous Tests of A not infrequent abuse in the analysis of multiple 
regression models is to examine the t* statistic in (6.51b): 


b k 


for each regression coefficient in turn to decide whether = 0 for A： = l,..., p— l. Even 
if a simultaneous inference procedure is used, and often it is not, problems still exist when 
the predictor variables are highly correlated. 

Suppose we wish to test whether = 0 and p 2 = 0 in the body fat example regression 
model with two predictor variables of Table 7.2c. Controlling the family level of significance 
at .05, we require with the Bonferroni method that each of the two f tests be conducted with 
level of significance .025. Hence, we need f (.9875; 17) = 2.46. Since both t* statistics 
in Table 7.2c have absolute values that do not exceed 2.46, we would conclude from the 
two separate tests that jSi = 0 and that — 0. Yet the proper F test for Ho ： pi = ^2 = 0 
would lead to the conclusion H a , that not both coefficients equal zero. This can be seen 
from Table 7.2c, where we find F* = MSR/MSE — 192.72/6.47 = 29.8, which far exceeds 
F(.95; 2, 17) = 3.59. 

The reason for this apparently paradoxical result is that each t* test is a marginal test, 
as we have seen in (7.15) from the perspective of the general linear test approach. Thus, 
a small 5'5 , i?(X 1 |X2) here indicates Xi does not provide much additional information 
beyond X 2 , which already is in the model; hence, we are led to the conclusion that = 0. 
Similarly, we are led to conclude 0 2 = Q here because SSR(X 2 \Xi) is smal], indicating that 
X 2 does not provide much more additional information when is,already in the model. 
But the two tests of the marginal effects of X 1 and X 2 together are not equivalent to testing 
whether there is a regression relation between f and the two predictor variables. The reason 
is that the reduced model for each of the separate tests contains the other predictor variable, 
whereas the reduced model for testing whether both = 0 and 你 = 0 would contain 



288 Part Two Multiple Linear Regression 


neither predictor variable. The proper F test shows that there is a definite regression 
here between Y and X E and X 2 - 

The same paradox would be encountered in Table 7.2d for the regression model with 
three predictor variables if three simultaneous tests on the regression coefficients were 
conducted at family level of significance .05. 


Comments 

1. It was noted in Section 7.5 that a near-zero determinant of X'X is a potential source of serious 
roundoff errors in normal equations calculations. Severe multicollinearity has the effect of making 
this determinant come close to zero. Thus, under severe multicollinearity, the regression coefficients 
may be subject to lai^e roundoff errors as well as large sampling variances. Hence, it is particularly 
advisable to employ the correlation transformation (7.44) in normal equations calciifations when 
multicollinearity is present. 

2. Just as high intercorrelations among the predictor variables tend to make the estimated re- 


gi-ession coefficients imprecise (i.e., erratic from sample to sample), so do the coefficients of partial 
correlation between the response variable and each predictor variable tend to become erratic from 
sample to sample when the predictor variables are highly coirelated. 


3. The effect of intercoi relations among the predictor variables on the standard deviations of the 
estimated regression coefficients can be seen readily when the variables in the model are transformed 
by means of the correlation transfoimation (7.44). Consider the first-oider model with two predictor- 
variables: 


K = A) + ^\^i\ + ^2^i2 + £'/ 

This model in the variables ti ansformed by (7.44) becomes: 

The (X'X) 1 matrix for this standardized model is given by (7.50) and (7,54c): 


(X^X) 1 


=r 


xx 


r n 


-r\i 


-i'i ： 


(7-61) 


(7.62) 


(7-63) 


Hence, the variance-covai iance mati ix of the estimated regression coefficients is by (6.46) and (7.63): 


d 2 {b} = (<r*) 2 r-' x = (a) 


— r 


12 


_ 厂12 


r \2 


(7.64) 


where (cr" ) 2 is the error term variance for the standardized model (7.62). We see that the estimated 
regression coefficients b\ and b* have the same variance here: 


= a 2 {K} 




(7.65) 


and that each of these variances become larger as the correlation between X i and X 2 inci eases. Indeed, 
asXi and X 2 approach perfect conelation (i.e., as/ p, approaches 1), the variances of and become 
larger without limit. 


4. We noted in our discussion of simultaneous tests of the regression coefficients that it is possi¬ 
ble that a set of piedictor variables is related to the response variable, yet all of the individual tests 
on the regiession coefficients will lead to the conclusion that they equal zero because of the multi- 
collineanty among the predictor variables. This apparently paradoxical result is also possible under 
special circumstances when there is no multicollinearity among the predictor variables. The special 
circumstances are not likely to be found in practice, however. _ 



Chapter 7 Multiple Regression II 289 


Need for More Powerful Diagnostics for Multicollinearity 

As we have seen, multicollinearity among the predictor variables can have important con- 
sequences for interpreting and using a fitted regression model. The diagnostic tool con¬ 
sidered here for identifying multicollinearity — namely, the pairwise coefficients of simple 
correlation between the predictor variables — is frequently helpful. Often, however, serious 
multicollinearity exists without being disclosed by the pairwise correlation coefficients. In 
Chapter 10, we present a more powerful tool for identifying the existence of serious multi¬ 
collinearity. Some remedial measures for lessening the effects of multicollinearity will be 
considered in Chapter 11. 


Cited 

Reference 


Problems 


7.1. Kennedy, W. J., Jr., and J. E. Gentle. Statistical Computing. New York: Marcel Dekker, 1980. 

- 

7 .L State the number of degrees of freedom that are associated with each of the following extra 
sums of squares: ( 1 ) SSRiX^)', (2) SSR(J( 2 \Xu X 3 )\ (3) SSR(X U X 2 \X 3 , X 4 ); (4) SSR{X U 
X 2 , X 3 \X 4 ,X 5 ). 

7.2. Explain in what sense the regression sum of squares SSR(X\) is an extra sum of squares. 

7.3. Refer to Brand preference Problem 6.5. 

a. Obtain the analysis of variance table that decomposes the regression sum of squares into 
extra sums of squares associated with A and with X 2 , given Xi- 

b. Test whether X 2 can be dropped from the regression model given that is retained. Use 
the F* test statistic and level of significance .01. State the alternatives, decision rule, and 
conclusion. What is the F-value of the test? 

^7.4. Refer to Grocery retailer Problem 6.9. 

a. Obtain the analysis of variance table that decomposes the regression sum of squares into 
extra sums of squares associated with X!; with X 3 , given Xi ； and with X 2 , given X\ and X 3 . 

b. Test whether X 2 can be dropped from the regression model given that X 1 and X 3 are retained. 
Use the F* test statistic and a = .05. State the alternatives, decision rule, and conclusion. 
What is the P-value of the test? 

c. Does SSR{Xi) + SSR(X 2 \Xi) equal SSR{X 2 ) + SSROCi\X 2 ) here? Must this always be the 
case? 

*7.5. Refer to Patient satisfaction Problem 6.15. 

a. Obtain the analysis of variance table that decomposes the regression sura of squares into 
extra sums of squares associated with X 2 ； with X 1 , given X 2 , and with X 3 , given X 2 and Xi. 

b. Test whether X 3 can be dropped from the regression model given that and X 2 are retained. 

Use the F* test statistic and level of significance .025. State the alternatives, decision rule, 
and conclusion. What is the P-^alue of the test? 

^7.6. Refer to Patient satisfaction Problem 6.15. Test whether both Xi and X-i can be dropped from 
the regression model given that X 1 is retained. Use o; = .025. State the alternatives, decision 
rule, and conclusion. What is the F-valne of the test? , 

7.7. Refer to Commercial properties Problem 6.18. 

a. Obtain the analysis of variance table that decomposes the regression sum of squares into 
extra sums of squares associated with X 4 ； with given X 4 ； with X 2 , given & and X 4 ； 
and with X^, given Xu X 2 and X 4 . 



290 Part Two Multiple Linear Regression 


b. Test whether X 3 can be dropped from the regression model given that X 2 and X 4 are 
retained. Use the F* tesi statistic and level of significance .01. State the alternatives, decision 
rule, and conclusion. What is the P-value of the test? 

7.8. Refer to Commercial properties Problems 6.18 and 7.7. Test whether both X 2 and X 3 can be 
dropped from the regression model given that and X 4 . are retained; use o; = . 01 . State the 
alternatives, decision rule, and conclusion. What is the P-vafue of the test? 

*7.9. Refer to Patient satisfaction Problem 6.15. Test whether fi\ = — 1.0 and /J 2 = 0; use o; = 025 
State the alternatives, full and reduced models, decision rule, and conclusion. 

7.10. Refer to Commercial properties Problem 6.18. Test whether fit = —A and = .4 ； use 
o; = .01. State the alternatives, fuff and reduced models, decision rule, and conclusion. 

7.11. Refer to the work crew productivity example in Table 7.6. 

a. Calculate R^, R 2 Y2 , R\ 2 , Ry 2 \ 1 > and R 2 . Explain what each coefficieni measures and 

interpret your results. 

b. Are any of the results obtained in part (a) special because the two predictor variables are 
uncorrelated? 

7.12. Refer to Brand preference Problem 6.5. Calculate R^, R 2 Y2 , r} 2 , /fj l| 2 , 的 2|1 , and /? 2 . Explain 
what each coefficient measures and interpret your results. 

*7.13. Refer to Grocery retailer Problem 6.9. Calculate r\ 2 , R\ 2 , and R 2 . 

Explain what each coefficient measures and interpret your results. 

*7.14. Refer to Patient satisfaction Problem 6.15. 

a. Calculate R^, /fy 1|2 , and Ry\py How is the degree of mai^inal [inear association between 
Y and X 1 affected, when adjusted for X 2 ? When adjusted for both X 2 and X 3 ? 

b. Make a similar analysis to that in part (a) for the degree of maiginal linear association 
between Y and X 2 . Are your findings similar to those in part (a) for Y and Xi? 

7.15. Refer to Commercial properties Problems 6.18 and 7.7. Calculate 咕，坼 , H ， 呤， 
朽 2 |i 4 ， 朽 3 |P 4 ， ar *d R 2 - Explain what each coefficient measures and interpret youi' results. 
How is the degree of marginal [inear association between Y and X\ affected, when adjusted 
for X 4 ? 

7.16. Refer to Brand preference Problem 6.5. 

a. Transform the variables by means of the coi relation transfoimation (7.44) and fit the stan- 
daidized regression model (7.45). 

b. Interpret the standardized regression coefficient b\. 

c. Transform the estimated standardized regression coefficients by means of (7.53) back to the 
ones for the fitted l-egiession model in the original variables. Verify that they are the same 
as the ones obtained in Problem 6 . 5 b. 

*7.17. Refer to Grocery retailer Piobfem 6.9. 

a. Transform the variables by means of the correlation transformation (7.44) and fit the 
standardized regression model (7.45). 

b. Calculate the coefficients of determination between all pairs of predictor variables. Is it 
meaningful here to consider the standardized regression coefficients to reflect the effect of 
one predictor variable when the others are held constant? 

c. Transform the estimated standardized regression coefficients by means of (7.53) back to the 
ones for the fitted regression model in the original variables. Verify that they are the same 
as the ones obtained in Problem 6 . 1 0a. 

*7.18. Refer to Patient satisfaction Problem 6.15. 



Chapter 7 Multiple Regression II 291 

a. Transform the variables by means of the correlation transformation (7.44) and fit the 
standardized regression model (7.45). 

b. Calculate the coefficients of determination between all pairs of predictor variables. Do these 
indicate that it is meaningful here to consider the standardized regression coefficients as 
indicating the effect of one predictor variable when the others are held constant? 

c. Transform the estimated standardized regression coefficients by means of (7.53) back to the 
ones for the fitted regression model in the original variables. Verify that they are the same 
as the ones obtained in Problem 6.15c. 

7.19. Refer to Commercial properties Problem 6.18. 

a. Transform the variables by means of the correlation transformation (7.44) and fit the stan¬ 
dardized regression model (7.45). . 

V. 

b. Interpret the standardized regression coefficient b^. 

c. Transform the estimated standardized regression coefficients by means of (7.53) back to the 
ones for the fitted regression model in the original variables. Verify that they are the s|rae 
as the ones obtained in Problem 6.18c. 

7.20. A speaker stated in a workshop on applied regression analysis: “In business and the socia] 
sciences, some degree of multicollinearity in survey data is practically inevitable.” Does this 

〆 statement 坪 ) ply equally to experimental data? 

7.21. Refer to the example of perfectly correlated predictor variables in Table 7.8. 

a. Develop another response function, like response functions (7.58) and (7.59), that fits the 
data perfectly. 

b. What is the intersection of the infinitely many response surfaces that fit the data perfectly? 

7.22. The progress report of a research analyst to the supervisor stated: “All the estimated regression 
coefficients in our model with three predictor variables to predict sales are statistically sig¬ 
nificant. Our new preliminary model with seven predictor variables, which includes the three 
variables of our smaller model, is less satisfactory because only two of the seven regression 
coefficients are statistically significant. Yet in some initial trials the expanded model is giving 
more precise sales predictions than the smaller model. The reasons for this anomaly are now 
being investigated" Comment. 

7.23. Two authors wrote as follows: “Our research utilized a multiple regression model. Two of 
the predictor variables important in our theory turned out to be highly correlated in our data 
set. This'made it difficult to assess the individual effects of each of these variables separately. 
We retained both variables in our model, however, because the high coefficient of multiple 
determination makes this difficulty unimportant ■” Comment. 

7.24. Refer to Brand preference Problem 6.5. 

a. Fit first-order simple linear regression model (2.1) for relating brand liking (F) to moisture 
content (X^). State the fitted regression function. 

b. Compare the estimated regression coefficient for moisture content obtained in part (a) with 
the corresponding coefficient obtained in Problem 6.5b. What do you find? 

c. Does SSR(Xi) equal SSR{Xi\X 2 、 here? If not, is the difference substantial? 

_ _ ^ r 

d. Refer to the correlation matrix obtained in Problem 6.5a. What bearing does this have on 

your findings in parts (b) and (c)? , , 

*7.25. Refer to Grocery retailer Problem 6.9. ^ 

a. Fit first-order simple linear regression model (2.1) for relating total hours required to handle 
shipment (F) to total number of cases shipped (X。. State the fitted regression function. 



292 Part Two Multiple Linear Regression 


Exercises 


b. Compare the estimated regression coefficient for total cases shipped obtained in part ⑻ 
with the corresponding coefficient obtained in Problem 6.10a. What do you find? 

c. Does SSR(X\) equal SSR(X\ |X 2 ) here? If not, is the difference substantial? 

d. Refer to the correlation matrix obtained in Problem 6.9c. What bearing does this have on 
your findings in parts (b) and (c)? 

*7.26. Refer to Patient satisfaction Problem 6.15. 

a. Fit first-order linear regression model ( 6 . [) for relating patient satisfaction (K) to patienfs 
age (X[) and severity of illness (X 2 ). State the fitted regression function. 

b. Compare the estimated regression coefficients for patient’s age and severity of illness ob¬ 

tained in part (a) with the corresponding coefficients obtained in Problem 6.15c. What do 
you find? ^ 

c. Does SSR(X\) equal IX 3 ) here? Does SSR(Xt) equal SSR(X 2 \X^) C ? 

d. Refer to the correlation matrix obtained in Problem 6 」 5b, What bearing does it have on 
your findings in parts (b) and (c)? 

7.27. Refer to Commercial properties Problem 6.18. ^ 

a. Fit fii-st-order linear regression model (6.1) for relating rental rates (/) to property age (X,) 
and size (X 4 ). State the fitted legression function. 

b. Compare the estimated regression coefficients for property age and size with the corre¬ 
sponding coefficients obtained in Problem 6.18c. What do you find? 

c. Does SSR(Xi) equal SSR(Xi\X 3 ) hem? Does SSR(Xi) equal SSR(X t \X^ 

d. Refer to the correlation matrix obtained in Problem 6.18b. What bearing does this have on 
yoiiL- findings in parts (b) and (c)? 


7.28. a. Define each of the following extra sums of squares: (1) SSR(X^\Xi)\ (2) SSR(X 3 , X 4 |X |)； 

(3)55/?(X 4 |X,,X 2 , X 3 ). 

b. For a multiple regression model with five X variables, what is the relevant extra sura of 
squares for testing whether or not ^ — 0 ? whether or not = /J 4 = 0 ? 

7.29. Show that: 

a. ^/?(X,,X 2 ,X 3 , X 4 ) = SSR(X t ) +55/?(X 2 , X 3 |X,) +SSR(X 4 \X^ X 2 , X 3 ). 

b. 55/?(X,,X 2 , X 3 ,X 4 ) = SSR(X 2 , X 3 ) + SSR(X' |X 2 , X 3 ) + ^/?(X 4 |X,, X 2 , X 3 ). 

7.30. Refer to Brand preference Problem 6.5. 

a. Regress Y on X 2 using simple linear regression model (2.1) and obtain the residuals. 

b. Regress X t on X 2 using simple linear regression model (2.1) and obtain the residuals. 

c. Calculate the coefficient of simple cori elation between the two sets of residuals and show 
that it equals /Vi| 2 . 

7.31. The following regression model is being considered in a water resources study: 

Yi — A) + -^['1 + + + Ai \ jh + £/ 

State the reduced models for testing whether or not: (!) ^ = /} 4 = 0 , ( 2 ) ^3 = 0 , (3) j 6 i = 

灼 = 5, ⑷仏 = 7. ’ ' 

7.32. The following regression model is being considered in a market research study: 


^ — ^0 + + + + £ v 



Chapter 7 Multiple Regression II 293 


State the reduced models for testing whether or not: ( 1 ) A == A = 0, （ 2) = 0, （ 3) 卢 3 = 5, 

( 4 ) % = 10 , ⑶户 ,= 仿 . 

7.33. Show the equivalence of the expressions in (7.36) and (7.41) for Ry 2 \i- 

7.34. Refer to the work crew productivity example in Table 7.6. 

a. For the variables transformed according to (7.44), obtain: ( 1 ) X’X, (2) X’Y, (3) b, (4) s 2 {b}. 

b. Show that the standardized regression coefficients obtained in part (a3) are related to the 
regression coefficients for the regression model in the original variables according to (7.53). 

7 . 35 . Derive the relations between the and in (7.46a) for p — 1 = 2. 

7.36. Derive the expression for X'Y in (7.51) for standardized regression model (7.30.) for p — 1 = 2. 


Projects 


7.37. Refer to the CD1 data set in Appendix C.2. For predicting the number of active physicians (F) 
in a county, it has been decided to include total population (Xi) and total personal income (X 2 ) 
as predictor variables. The question now is whether an additional predictor variable woulfe be 
helpful in the model and, if so, which variable would be most helpful. Assume that a first-order 
multiple regression model is appropriate. 

a. For each of the following variables, calculate the coefficient of partial determination given 

， that X! and X 2 are included in the model: land area (X 3 ), percent of population 65 or older 

(X 4 ), number of hospital beds (X 5 )，and total serious crimes (Xe). 

b. On the basis of the results in part (a), which of the four additional predictor variables is best? 
Is the extra sum of squares associated with this variable lai^er than those for the other three 
variables? 

c. Using the F* test statistic, test whether or not the variable determined to be best in part (b) 
is helpful in the regression model when and X 2 are included in the model; use a = . 01 . 
State the alternatives, decision rule, and conclusion. Would the F* test statistics for the other 
three potential predictor variables be as large as the one here? Discuss. 

7.38. Refer to the SENIC data set in Appendix C.l. For predicting the averse length of stay of 
patients in a hospital (F), it has been decided to include age (X^) and infection risk (D as 
predictor variables. The question now is whether an additional predictor variable would be 
helpful in the model and, if so, which variable would be most helpful. Assume that a first-order 
multiple regression model is appropriate. 

a. For each of the following variables, calculate the coefficient of partial determination given 
that and X 2 are included in the model: routine culturing ratio (X 3 ), average daily census 
(X 4 )，number of nurses (X 5 )，and available facilities and services (Xe). 

b. On the basis of the results in part (a), which of the four additional predictor variables is best? 
Is the extra sum of squares associated with this variable larger than those for the other three 
variables? 

c. Using the F* test statistic, test whether qr not the variable determined to be best in part ib) 
is helpful in the regression model when and X 2 are included in the model; use a = .05. 
State the alternatives, decision rule, and conclusion. Would the F* test statistics for the other 
three potential predictor variables be as large as the one here? Discuss. 



Chapter 


Regression Models 

for Quantitative 

and Qualitative Predictors 


[n this chapter, we consider in greater detail standard modeling techniques for quantitative 
predictors, for qualitative predictors, and for regression models containing both quantitative 
and qualitative predictors. These techniques include the use of interaction and polynomial 
terms for quantitative predictors, and the use of indicator variables for qualitative predictors. 

8.1 Polynomial Recession Models 

_^_Q_____ _ _ _ __ 

We first consider polynomial regression models for quantitative predictor variables. They 
are among the most frequently used curvilinear response models in practice because they 
are handled easily as a special case of the general linear regression model (6.7). Next, we 
discuss several commonly used polynomial regression models. Then we present a case to 
illustrate some of the major issues encountered with polynomial regression models. 

Uses of Polynomial Models 

Polynomial regression models have two basic types of uses: 

1. When the true curvilinear response function is indeed a polynomial function. 

2. When the true curvilinear response function is unknown (or complex) but a polynomial 
function is a good approximation to the true function. 

The second type of use, where the polynomial function is employed as an approximation 
when the shape of the true curvilinear response function is unknown, is very common. It 
may be viewed as a nonparameti'ic approach to obtaining information about the shape of 
the response function. 

A main danger in using polynomial regression models, as we shall see, is that extrap¬ 
olations may be hazardous with these models, especially those with higher-order terms. 
Polynomial regression models may provide good fits for the data at hand, but may turn in 
unexpected directions when extrapolated beyond the range of the data. 


Chapter 8 Regression Models for Quantitative and Qualitative Predictors 295 


⑽ Predictor Variable — Second Order 

Polynomial regression models may contain one, two, or more than two predictor variables. 
Further, each predictor variable may be present in various powers. We begin by considering 
a polynomial regression model with one predictor variable raised to the first and second 
powers: 

Y[ — ^o + + + Si ( 8 . 1 ) 

where: 

Xi = Xi~X 

This polynomial model is called a second-order model with one predictor variable because 
the single predictor variable is expressed in the model to the first and second powers. Note 
that the predictor variable is centeredin other words, expressed as a deviation around its 
mean X — and that the ith centered observation is denoted by x,-. The reason for using a 
centered predictor variable in the polynomial regression model is thatX and X 2 often will be 
highly correlated. This, as we noted in Section 7.5, can cause serious computational difficul¬ 
ties wh^i the X y X matrix is inverted for estimating the regression coefficients in the normal 
equations calculations. Centering the predictor variable often reduces the multicollinear- 
ity substantially, as we shall illustrate in an example, and tends to avoid computational 
difficulties. 

The regression coefficients in polynomial regression are frequently written in a slightly 
different fashion, to reflect the pattern of the exponents: 

Yi = + P\^i + Pn x f + £ i ( 8 . 2 ) 

We shall employ this latter notation in this section. 

The response function for regression model (8.2) is: 

E{Y] = Po + + (8.3) 

This response function is a parabola and is frequently called a quadratic response function. 
Figure 8.1 contains two examples of second-order polynomial response functions. 

FIGURE 8.1 Y 

Examples of 60 

Second-Order 
Pofynomial 50 

Response 

Functions. 40 

30 
20 
10 

0 „ 

-2-1 0 1 2 x -2-1 0 1 2 x 



⑻ 


(b) 



296 Part Two Multiple Linear Regress ion 


The regression coefficient represents the mean response of Y when jc = 0, i e l 
X — X. The regression coefficient is often called the linear effect coefficient, andjtJ ■ 
called the quadratic effect coefficient. 111 


is 


Comments 

1. The danger of extrapolating a polynomial response function is illustrated by the response function 
in Figure 8.1a. If this function is extrapolated beyond x = 2, it actual[y turns downwai'd, which 
might not be appropriate in a given case. 

2. The algebraic version of the least squares normal equations: 

XXb = XY 

!/* 

for the second-order polynomial regression model (8.2) can be readily obtained from (6.77) by 
replacing byjc f - and by xf. Since =0, this yields the normal equations: 


y； = nbo + b x , ^2 x f 

xiYi=bi 

x f y i — ^0 二 xf + bi Z xf + bn Z 


(8.4) 




■ 


One Predictor Variable — Third Order 

The regression model: 

^ + ^i x i + i-^f + A li' 3 十 (8.5) 

where: 

Xi ^Xi~X 

is a thircl-order model with one predictor variable. The response function for regression 
model (8.5) is: 

二汍十 Ax + 〜 x 2 + 仏 , 〆 (8.6) 

Figure 8.2 contains two examples of third-order polynomial response functions. 

One Predictor Variable — Higher Orders 

Polynomial models with the predictor variable present in higher powers than the third 
should be employed with special caution. The interpretation of the coefficients becomes 
difficult for such models, and the models may be highly erratic for interpolations and even 
small extrapolations. It must be recognized in this connection that a polynomial model of 
sufficiently high order can always be found to fit data containing no repeat observations 
perfectly. For instance, the fitted polynomial regression function for one predictor variable 
of order n — 1 will pass through all n observed Y values. One needs to be wary, therefore, of 
using high-order polynomials for the sole purpose of obtaining a good fit. Such regression 
functions may not show clearly the basic elements of the regression relation between X and 
Y and may lead to erratic interpolations and extrapolations. 



Chapter 8 Regression Models for Quantitative and Qualitative Predictors 297 



TWO Predictor Variables — Second Order 

- Ifhe regression model: 


K = Po 十 + ^2^i2 + ^n x fi + 022^2 jS12^(1^!2 + 

where: 


(8.7) 


x “ = X,” — Xi 
Xi2 = Xi2 一 叉2 


is a second-order model with two predictor variables. The response function is: 

E{Y) = ^0 + P 1 X 1 + ^2^2 + ^ 11^1 + ^ 22^2 + ^ 12 ^ 1^2 ( 8 . 8 ) 

which is the equation of a conic section. Note that regression model (8.7) contains separate 
linear and quadratic components for each of the two predictor variables and a cross-product 
term. The latter represents the interaction effect between and X 2 , as we noted in Chapter 6. 
The coefficient j0i 2 is often called the interaction effect coefficient. 

Figure 8.3 contains a representation of the response surface and the contour curves for 
a second-order response function with two predictor variables: 

E{Y\= 1,740 - Ax\ - 3xl - 3 叩 2 


The contour curves correspond to different,response levels and show the various combi¬ 
nations of levels of the two predictor variables that yield the same level of response. Note 
that the response surface in Figure 8.3a has a maximum atxi =0 and x-i = 0. Figure 6.2b 
presents another type of second-order polynomial response function with two predictor 
variables, this one containing a saddle point. ^ 


Comment 

The cross-product term ^\ 2 MX ，2 in (8.8) is considered to be a second-order term, the same as 
or ^ 22 ^ 2 - The reason can be seen by writing the latter terms as 办 and ^ 22 X 1 X 2 , respectively. ■ 



298 Part Two Multiple Linear Regression 


> 


FIGURE 8.3 Example of a Quadratic Response Surface— E{Y} = 1,740 — 4jcJ — 3jc| — 3jciJc 2 . 

(b) Contour Curves 

-10 -5 0 5 10 

10 

5 

夂 0 


—5 

—10 

—10 —5 0 ^ 5 10 


Three Predictor Variables — Second Order 

The second-order regression model with three predictor variables is: 

K = A) + + ^2^i2 + A^'3 + + ^72.^12 + ^33^3 

十 1^12^/1-^/2 + ^U x il x i3 + 023 々 2 义， 3 十 & (8.9) 

where: 

叉 /i = X/i — Xi 

^i2 ~ X[2 — X 2 
^*3 — ^i3 — 叉 3 

The response function for this regression model is: 

E{Y\ = Po^~ PlM + ^2^2 + ^ 3^3 + Pll x ^ + ^22^2 + ^ 33^3 

\ 

+ Pl2^lX2 + + ^ 23 ^ 2^3 ( 8 . 10 ) 

The coefficients 卢 12 , jSn ，and 卢 23 are interaction effect coefficients for interactions between 
pairs of predictor variables. 

Implementation of Polynomial Regression Models 

Fitting of Polynomial Models. Fitting of polynomial regression models presents no new 
problems since, as we have seen in Chapter 6, they are special cases of the general linear 
regression model (6.7). Hence, all earlier results on fitting apply, as do the earlier results on 
making inferences. 



(a) Response Surface 





10 


-10 


— in 


Chapter 8 Regression Models for Quantitative and Qualitative Predictors 299 


Hierarchies 里 Approach to Fitting. When using a polynomial regression model as an 
approximation to the true regression function, statisticians will often fit a second-order or 
third-order model and then explore whether a lower-order model is adequate. For instance, 
with one predictor variable, the model: 

Yi = 0o + + P\ixf + P\nxf + £i 

may be fitted with the hope that the cubic term and perhaps even the quadratic term can be 
dropped. Thus, one would wish to test whether or not ^ in =0, or whether or not both 氏 i = 0 
and 卢 in =0. The decomposition of SSR into extra sums of squares therefore proceeds as 
follows: 

SSR(x) 奶 

SSR(x 2 \x) 

SSR(x 3 \x,x 2 ) 

To test whether Pni =0, the appropriate extra sum of squares is x 2 ). If, in¬ 
stead, one wishes to test whether a linear term is adequate, i.e.，whether j6n = =0, the 

appropriate extra sum of squares is SSR(x 2 , jc 3 |j:) = + x 2 ). 

With the hierarchical approach, if a polynomial term of a given order is retained, then 
all related terms of lower order are also retained in the model. Thus, one would not drop 
the quadratic term of a predictor variable but retain the cubic term in the model. Since the 
quadratic term is of lower order, it is viewed as providing more basic information about the 
shape of the response function; the cubic term is of higher order and is viewed as providing 
refinements in the specification of the shape of the response function. The hierarchical 
approach to testing operates similarly for polynomial regression models with two or more 
predictor variables. Here, for instance, an interaction term (second power) would not be 
retained without also retaining the terms for the predictor variables to the first power. 


Regression Function in Terms of X. After a polynomial regression model has been 
developed, we often wish to express the final model in terms of the original variables rather 
than keeping it in terms of the centered variables. This can be done readily. For example, the 
fitted second-order model for one predictor variable that is expressed in terms of centered 
values x = X — X: 


Y = b 0 + b l x + b n x 2 (8.11) 

becomes in terms of the original X variable: 

Y = +b\X + y u X 2 (8.12) 

where: ' 

b f 0 = b 0 - b { X + b u X 2 (8.12a) 

b\ =b Y -2b u X (8.12b) 

« * 

b' n = b\\ (8.12c) 

The fitted values and residuals for the regression function in terms of X are exactly the 
same as for the regression function in terms of the centered values x. The reason, as we 



300 Part Two Multiple Linear Regression 


noted eailier, for utilizing a model that is expressed in terms of centered observations is to 
reduce potential calculational difficulties due to multicollinearity among X, X 2 , X 3 , etc 
inherent in polynomial regression. ’ 

Comment 

The estimated standard deviations of the regression coefficients in terms of the centered variables x 
in (8.1 I) do not apply to the regression coefficients in terms of the original variables X in (8.12) If 
the estimated standaiti deviations for the regression coefficients in terms of X are desired, they may 
be obtained by using (5.46), where the transformation matrix A is developed from (8.12a-c). ■ 

Case Example 

Setting. A researcher studied the effects of the charge rate and temperature on the lif e 
of a new type of power cell in a preliminary small-scale expeiiment. The charge rate 
was controlled at three levels (.6, 1.0, and 1.4 amperes) and the ambient temperature (X 2 ) 
was controlled at three levels (10, 20, 30°C). Factors pertaining to the discharge of the 
power cell were held at fixed levels. The life of the power cell (7) was measured in terms 
of the number of discharge-charge cycles that a power cell underwent before it failed. The 
data obtained in the study are contained in Table 8.1, columns 1-3. 

The researcher was not sure about the nature of the response function in the range of the 
factors studied. Hence, the researcher decided to fit the second-order polynomial regressicai 
model (8.7): 


K = A) + + ^2^i2 + + ^22^2 + ^\2^i\Xi2 + £/ 

for which the response function is: 

E{Y] = ^0 + f^\ x \ + ^2X2 + 011 叉 f 十 ^22^2 + ^ 12 ^ 1^2 


(8.13) 


(814) 


TABLE 8.1 Data — Power Cells Example. 



Setting adapted from: S. M. Stdik. H. F. Leibecki. and J. M. Bozek, Cyckw Tiff Failure nfSifv^r-Ziuc Ceffs wiih Cnmpcim^ Faih/rv Mmh>s — Pivfiminary Dam 
Analysis,, NASA Teclinicul Memorandum 815-56. 1980. 





Chapter 8 Regression Models for Quantitative and Qualitative Predictors 301 


Because of the balanced nature of the Xi and X 2 levels studied, the researcher not only 
centered the variables X L and X 2 around their respective means but also scaled them in 
convenient units, as follows: 


xn = 


Xi2 = 


X/i - ■ X' Xu — 1.0 
Xi2 — 文 2 Xi2 — 2Q 


10 


10 


(8.15) 


Here, the denominator used for each predictor variable is the absolute difference between 
adjacent levels of the variable. These centered and scaled variables are shown in columns 4 
and 5 of Table 8.L Note that the codings defined in (8.15) lead to simple coded values, — 1 ， 
0, and 1. The squared and cross-product terms are shown in columns 6—8 of Table 8.1. 

Use of the coded variables xj and 々 rather than the original variables X 1 arid X 2 reduces 
the correlations between the first power and second power terms markedly here: 


Correlation between 

X 飞 and Xf: .991 

Xi and xf ： 0.0 


Correlation between 

X 2 and X\' .986 

x 2 and x|: 0.0 


The correlations for the coded variables are zero here because of the balance of the design 
of the experimental levels of the two explanatory variables. Similarly, the correlations 
between the cross-product term X\X 2 and each of the terms X 2 , are reduced to zero 

here from levels between .60 and .76 for the corresponding terms in the original variables. 
Low levels of multicollinearity can be helpful in avoiding computational inaccuracies. 

The researcher was particularly interested in whether interaction effects and curvature 
effects are required in the model for the range of the X variables considered. 

Fitting of Model. Figure 8.4 contains the basic regression results for the fit of model (8.13) 
with the SAS regression package. Using the estimated regression coefficients (labeled 
Parameter Estimate), we see that the estimated regression function is as follows: 

Y = 162.84 - 55.83a +75.5(k 2 + 27.39xf — 10.61^ + 11.50^^2 (8.16) 

Residual Plots. The researcher first investigated the appropriateness of regression 
model (8.13) for the data at hand. Plots of the residuals against Y, x u and X 2 are shown 
in Figure 8.5, as is also a normal probability plot. None of these plots suggest any gross 
inadequacies of regression model (8.13). The coefficient of correlation between the ordered 
residuals and their expected values under normality is ,974, which supports the assumption 
of normality of the error terms (see Table B.6).' 

Test of Fit. Since there are three replications atxi =0, x 2 = 0, another indication of the 
adequacy of regression model (8.13) can be obtained by the formal test in (6.68) of the good¬ 
ness of fit of the regression function (8.14). The pure error sum of squares (3.16) is simple 
to obtain here, because there is only one combination of levels at which replications Occur: 

SSPE= (157 — 157.33) 2 十 （131 — 157.33) 2 + (184 - 157.33) 2 

= 1,404.67 



302 Part Two Multiple Linear Regression 


FIGURE 8.4 

SAS 


Model : M0DEL1 
Dependent Variable: Y 


Regression 



Analysis of 

Variance 



Output for 



Sum of 

Mean 



Second-Order 

Source 

DF 

Squares 

Square 

F Value 

Prob>F 

Polynomial 

Model 

5 

55365.56140 

11073.11228 

lo(565 

0.0109 

Model 

Error 

5 

5240.43860 

1048.08772 



(8.13) — Power 

C Total 

10 

60606.00000 




Cells Example. 

Root MSE 


32.37418 

R-square 

0.9135 



Dep Mean 


172.00000 

Adj R-sq 

0.8271 



d 


18.82220 





Parameter Estimates 




Parameter 

Standard 

T for HO: 


Variable 

DF 

Estimate 

Error 

Parameter=0 

Prob > |T| 

INTERCEP 

1 

162.842105 

16.60760542 

9.805 

0.0002 

XI 

1 

-55.833333 

13.21670483 

-4.224 

0.0083 

X2 

1 

75.500000 

13.21670483 

UT712 

0.0023 

X1SQ 

1 

27.394737 

20.34007956 

1.347 

0.2359 

X2SQ 

1 

-10.605263 

20.34007956 

^ -0.521 

0.6244 

X1X2 

1 

11.500000 

16.18709146 

0.710 

0.5092 

Variable 

DF 

Type I SS 




INTERCEP 

1 

325424 




XI 

1 

18704 




X2 

1 

34202 




X1SQ 

1 

1645.966667 




X2SQ 

1 

284,928070 




X1X2 

1 

529.000000 





Since there are c = 9 distinct combinations of levels of the X variables here, there are 
n — c = 11 —9 = 2 degrees of freedom associated with SSPE. Further, SSE = 5,240.44 
according to Figure 8.4; hence the lack of fit sum of squares (3.24) is: 


SSLF = SSE- SSPE = 5,240.44 - 1,404.67 = 3,835.77 


with which c—p = 9 — 6 = 3 degrees of freedom are associated. (Remember that p = 6 
regression coefficients in model (8.13) had to be estimated.) Hence, test statistic (6.68b) for 
testing the adequacy of the regression function (8.14) is: 

,SSLF SSPE 3,835.77 1,404.67 

F* = -+- = --+ --= 1.82 

c — p n — c 3 2 

For a = .05, we require F(.95; 3, 2) = 19.2. Since F* = 1.82 < 19.2, we conclude 
according to decision rule (6.68c) that the second-order polynomial regression function 
(8.14) is a good fit. 

Coefficient of Multiple Determination. Figure 8.4 shows that the coefficient of multiple 
determination (labeled R-square) is R 2 — .9135. Thus, the variation in the lives of the power 
cells is reduced by about 91 percent when the first-order and second-order relations to the 
chaige rate and ambient temperature are utilized. Note that the adjusted coefficient of mul¬ 
tiple correlation (labeled Adj R-sq) is = .8271. This coefficient is considerably smaller 
here than the unadjusted coefficient because of the relatively large number of parameters in 
the polynomial regression function with two predictor variables. 



Chapter 8 Regression Models for Quantitative and Qualitative Predictors 303 


(d) Normal Probability Plot 


50 r 
30 

10 

-10 

-30 
-50 


« • 


60 -40 -20 0 20 40 60 

Expected 


(c) Residual Plot against x 2 


9 


) 


0 

X2 


f lCURE8.5 

Diagnostic 

Residual 

plats—P*> wer 

Cells Example. 


(a) Residual Plot against Y 


60 r 


40 


(b) Residual Plot against 


60 r 


40 


Partial F Test. The researcher now turned to consider whether a first-order model would 
be sufficient. The test alternatives are: 

Ho ： Pu ^ P22 — P12 — 0 
H a \ not all ps in Ho equal zero 

The partial F test statistic (7.27) here is: 

i 

I SSR(x^,x^, x\x 2 a ： i, x 2 ) 

F* = —— V__l> 2 ； i) + MSE ^ 

3 

In anticipation of this test, the researcher entered the X variables in the SAS regression 
program in the order x \, X 2 , X\X 2 , as may be seen at the bottom of Figure 8.4. The 

extra sums of squares are labeled Type I SS. The first sum of squares shown is not relevant 
here. The second one is SSR{x x ) = 18,704, the third one is *55 衣 (x 2 |ii) = 34,202, and so 


J 2 


6040 


20 

lenpisa^ 


o 


o o 

2 4 


0 


o 


0 $ 0 0 X 


$ 


2 


L 


oXI 


o o o o 

2 2 4 

I 

lenpisa^ 


2 


0 


o 

Jo 


o 


9 


o o 


0 


9 


® 


9 


0 


o 

o 

2 


00 


rr 


o o o o 

2 2 4 

I 

lenpisa^ 


o 



304 Part Two Multiple Unear Regression 


FIGURE 8.6 
S-Plus Plot of 
Fitted 

Response Plane 
(8.19)—Power 
Cells Example. 


on. The required extra sum of squares is therefore obtained as follows: 

SSR^, jciX 2 |a ： i, x 2 ) = |.^i, x 2 ) + a ，O 

+ SSR{x x x 2 X[, x 2 , xf, ^2) 

=1,646.0 + 284.9 + 529.0 = 2,459.9 

We also require the error mean square. We find in Figure 8.4 that it is MSE— \ ,048.1. Hence 
the test statistic is: 

F* = 2,4 ^ 9 - + 1,048.1 = .78 

For level of significance cc = .05, we require F(.95; 3, 5) = 5.41. Since F* = . /8 < 541 
we conclude H lu that no curvature and interaction effects are needed, so that a first-order 
model is adequate for the range of the charge rates and temperatures considered. 

First-Order Model. On the basis of this analysis, the researcher decided to consider the 
first-order model: 4 


= ^0 + ^\ x i\ + ^2 x i2 + £(' 

A fit of this model yielded the estimated response function: 

Y = 172.00 — 55.834 +75.5(k 2 
(12.67) (12.67) 


(8.17) 


(8.18) 


Note that the regression coefficients b\ and bi are the same as in (8.16) for the fitted second- 
order model. This is a result of the choices of the X! and X 2 levels studied. The num¬ 
bers in parentheses under the estimated regression coefficients are their estimated standard 
deviations. A variety of residual plots for this first-order model were made and analyzed 
by the researcher (not shown here), which confirmed the appropriateness of first-order 
model (8.17). 

Fitted First-Order Model in Terms of X. The fitted first-order regression function (8.18) 
can be transformed back to the original variables by utilizing (8.15). We obtain: 


Y= 160.58 - 139.58X, + 7.55X 2 


(8.19) 


Figure 8.6 contains an S-Plus regression-scatter plot of the fitted response plane. The 
researcher used this fitted response surface for investigating the effects of charge rate and 
temperature on the life of this new type of power cell. 




Chapter 8 Regression Models for Quantitative and Qualitative Predictors 305 


Estimation of Regression Coefficients. The researcher wished to estimate the linear 
effects of the two predictor variables in the first-order model, with a 90 percent family 
confidence coefficient, by means of the Bonferroni method. Here, g = 2 statements are 
desired; hence, by (6.52a), we have: 

B = t[l- .10/2(2)] = f (.975; 8 ) = 2.306 

The estimated standard deviations of b\ and b 2 in (8.18) apply to the model in the coded vari¬ 
ables. Since only first-order terms are involved in this fitted model, we obtain the estimated 
standard deviations of ^ and ^ 2 for the fitted model (8.19) in the original variables as follows: 

— =31.68 

s Wi\ = = L267 

L 

The Bonferroni confidence limits by (6.52) therefore are —139.58 士 2.306(31.68) and 
7.55 士 2.306(1.267), yielding the confidence limits: 

〆 -212.6 < < -66.5 4.6 < p 2 < 10.5 

With confidence .90, we conclude that the mean number of charge/discharge cycles before 
failure decreases by 66 to 213 cycles with a unit increase in the charge rate for given ambient 
temperature, and increases by 5 to 10 cycles with a unit increase of ambient temperature 
for given chaige rate. The researcher was satisfied with the precision of these estimates for 
this initial small-scale study. 

Some Further Comments on Polynomial Regression 

1. The use of polynomial models is not without drawbacks. Such models can be more 
expensive in degrees of freedom than alternative nonlinear models or linear models with 
transformed variables. Another potential drawback is that serious multicollinearity may be 
present even when the predictor variables are centered. 

2. An alternative to using centered variables in polynomial regression is to use orthog¬ 
onal polynomials. Orthogonal polynomials are uncorrelated. Some computer packages use 
orthogonal polynomials in their polynomial regression routines and present the final fitted 
results in terms of both the orthogonal polynomials and the original polynomials. Orthog¬ 
onal polynomials are discussed in specialized texts such as Reference 8.1. 

3. Sometimes a quadratic response function is fitted for the purpose of establishing the 
linearity of the response function when repeat observations are not available for directly 
testing the linearity of the response function. Fitting the quadratic model: 

R = A) 十 + E[ (8.20) 

and testing whether = 0 does not, however, necessarily establish that a linear response 
function is appropriate. Figure 8.2a provides an example. If sample data were obtained for 
the response function in Figure 8 . 2 a, model (8.20) fitted, and a test on jSn made, it likely 
would lead to the conclusion that = 0. Yet a linear response function clearly might not 
be appropriate. Examination of residuals would disclose this lack of fit and should always 
accompany formal testing of polynomial regression coefficients. 



306 Part Two Multiple Linear Regression 


8.2 Interaction Regression Models 


We have previously noted that regression models with cross-product interaction effects 
such as regression model (6.15), are special cases of general linear regression model ( 67 )’ 
We also encountered regression models with interaction effects briefly when we considered 


polynomial regression models, such as model (8.7). Now we consider in some detail re¬ 
gression models with interaction effects, including their interpretation and implementation 


Interaction Effects 

A regression model with p — 1 predictor variables contains additive effects if the response 
function can be written in the form: 

E{Y}^ /.(X0 + fi(X 2 ) 十 •.. + h(h) (8.21) 

where / 2 ,, f p -i can be any functions, not necessarily simple ones. For instance, 
the following response function with two predictor variables dan be expressed in the form 
of ( 8 . 21 ): 

E{YJ — Xi -t + 

./l(X,) fllX2) 

We say here that the effects of X 1 and X 2 on Y are additive. 

In contrast, the following regression function: 


E{Y] = + + JS 2 X 2 + ^ 3 X 1 X 2 


cannot be expressed in the form (8.21). Hence, this latter regression model is not additive, 
or, equivalently, it contains an interaction effect. 

A simple and commonly used means of modeling the interaction effect of two predictxx 
variables on the response variable is by a cross-product term, such as ^XxX 2 in the above 
response function. The cross-product term is called an interaction term. More specifically, 
it is sometimes called a linear-by-linear or a bilinear interaction term. When there are three 
predictor variables whose effects on the response variable are linear, but the effects on K of 
X, and X 2 and of X \ and X 3 are interacting, the response function would be modeled as 
follows using cross-product terms: 

E{Y] = + ^1 Xi + ^2^2 + + ^4X1X9 + Xj 


Interpretation of Interaction Regression Models with Linear Effects 

We shall explain the influence of interaction effects on the shape of the response function 
and on the interpretation of the regression coefficients by first considering the simple case of 
two quantitative predictor variables where each has a linear effect on the response variable. 

Inteipretation of Regression Coefficients. The regression model for two quantitative 
predictor variables with linear effects on y and interacting effects of Xi and Xo on Y 
represented by a cross-product term is as follows: 

K = A ) 十 Pi Ai + ^ 2^/2 + + £/ (8.22) 



Chapter 8 Regression Models for Quantitative and Qualitative Predictors 307 


The meaning of the regression coefficients Pi and 你 here is not the same as that given earlier 
because of the interaction term ^XaX^ The regression coefficients 仏 and no longer 
indicate the change in the mean response with a unit increase of the predictor variable, with 
the other predictor variable held constant at any given level. It can be shown that the change 
in the mean response with a unit increase in X ! when X 2 is held constant is: 

灼十 (8.23) 

Similarly, the change in the mean response with a unit increase in X 2 when X\ is held 
constant is: 

^2 + ^X! (8.24) 

Hence, in regression model (8.22) both the effect of X i for given level of X2 anti the effect 
of X 2 for given level of X 1 depend on the level of the other predictor variable. > 

We shall illustrate how the effect of one predictor variable depends on the level of the 
other predictor variable in regression model (8.22) by returning to the sales promotion 
response function shown in Figure 6.1 on page 215. The response function (6.3) for this 
example, relating locality sales (F) to point-of-sale expenditures (Xi) and TV expenditures 
(X2), is additive: 

E{Y\ -10 + 2Xi + 5X 2 (8.25) 

In Figure 8.7a, we show the response function £{y} as a function of X\ when ^2 = 1 
and when X2 = 3. Note that the two response functions are parallel — that is, the mean 
sales response increases by the same amount jSi =2 with a unit increase of point-of-sale 
expenditures whether TV expenditures are X2 — 1 or X2 = 3. The plot in Figure 8.7a is 
called a conditional effects plot because it shows the effects of on the mean response 
conditional on different levels of the other predictor variable. 

In Figure 8.7b, we consider the same response function but with the cross-product term 
.5 XiX 2 added for interaction effect of the two types of promotional expenditures on sales: 

E{Y} =10 + 十 5X 2 + .5X,X 2 (8.26) 


FIGURE 8.7 Illustration of Reinforcement and Interference Interaction Effects — Sales Promotion Example. 


(b) (c) 

(a) Reinforcement Interference 

Additive Model Interaction Effect Interaction Effect 




308 Part Two Multiple Linear Regression 


We again use a conditional effects plot to show the response function E{Y} as a functi 
of conditional on X? = I and on X 2 — 3. Note that the slopes of the response functiorg 
plotted against X i now differ for Xi = 1 and X 2 = 3. The slope of the response functi 
when X 2 = Us by (8.23): ° n 


4 - = 2 十 ， 5(1) = 2.5 

and when X 2 = 3, the slope is: 

jSi + 163^2 = 2 + .5(3) = 3.5 

Thus, a unit increase in point-of-sale expenditures has a larger effect on sales iien Ty 
expenditures are at a higher level than when they are at a lower level. 

Hence , 氏 in regression model (8.22) containing a cross-product term for interaction 
effect no longer indicates the change in the mean response for a unit increase in X\ for any 
given Xi level. That effect in this model depends on the level of X 2 . Although the mean 
response in regression model ( 8 . 22 ) when X? is constant is still.a linear function of X,, now 
both the intercept and the slope of the response function change as the level at which X 2 is 
held constant is varied. The same holds when the mean response is regarded as a function 
of X 2 , with X\ constant. 

Note that as a result of the interaction effect in regression model (8.26), the increase 
in sales with a unit increase in point-of-sale expenditures is greater, the higher the level 
of TV expenditures, as shown by the larger slope of the response function when X 2 = 3 
than when X 2 = L A similar increase in the slope occurs if the response function against 
X 2 is considered for higher levels of Xj. When the regression coefficients j3i and 佐 are 
positive, we say that the interaction effect between the two quantitative variables is of a 
reinforcement or synergistic type when the slope of the response function against one of the 
predictor variables increases for higher levels of the other predictor variable (i‘e+, when j 6 3 
is positive). 

【f the sign of ft in regression model (8.26) were negative: 


E{Y} = \0 + 2X,+5X 2 - .5X^2 (8.27) 

the result of the interaction effect of the two types of promotional expenditures on sales 
would be that the increase in sales with a unit increase in point-of-sale expenditures becomes 
smaller, the higher the level of TV expenditures. This effect is shown in the conditional 
effects plot in Figure 8.7c. The two response functions for X 2 = l and X 2 = 3 are again 
nonparallel, but now the slope of the response function is smaller for the higher level of 
TV expenditures. A similar decrease in the slope would occur if the response function 
against X 2 is considered for higher levels of Xi. When the regression coefficients Pi and 
are positive, we say that the interaction effect between two quantitative variables is of 
an interference or antagonistic type when the slope of the response function against one of 
the predictor variables decreases for higher levels of the other predictor variable (i.e., when 
03 is negative) ‘ 

Comments 

I. When the signs of and in regression model (8.22) are negative, a negative J 63 is usually 
viewed as a reinforcement type of interaction effect and a positive ^ as an interference type of effect 



4 


Chapter 8 Regression Models for Quantitative and Qualitative Predictors 309 


2. To derive (8.23) and (8.24), we differentiate: 


= A ) + + ^2^2 + ^3X1X2 


with respect to Xi and X 2 , respectively: 

dE{Y} 


dXi 


A + ^3^2 


dE{Y) 

dX 2 


= /3 2 + 


■ 

Shape of Response Function. Figure 8.8 shows for the sales promotion example the 
impact of the interaction effect on the shape of the response function. Figure 8.8a presents the 
additive response function in (8.25), and Figures 8.8b and 8.8c present the response functions 
with the reinforcement interaction effect in (8.26) and with the interference interaction effect 
in (8.27), respectively. Note that the additive response function is a plane, but that the two 
response functions with interaction effects are not. Also note in Figures 8.8b and 8.8c that 
the mean response as a function of Xi, for any given level of X 2 , is no longer parallel to the 
same function at a different level of X 2 , for either type of interaction effect. 

We can also illustrate the difference in the shape of the response function when the 
two predictor variables do and do not interact by representing the response surface by 
means of a contour diagram. As we noted previously, such a diagram shows for different 
response levels the various combinations of levels of the two predictor variables that yield 
the same level of response. Figure 8.8d shows a contour diagram for the additive response 
surface in Figure 8.8a when the two predictor variables do not interact. Note that the contour 
curves are straight lines and that the contour lines are parallel and hence equally spaced. 
Figures 8.8e and 8.8f show contour diagrams for the response surfaces in Figures 8.8b 
and 8.8c, respectively, where the two predictor variables interact. Note that the contour 
curves are no longer straight lines and that the contour curves are not parallel here. For 
instance, in Figure 8.8e the vertical distance between the contours for E{Y) = 200 and 
£(F} = 400 at Xi = 10 is much laiger than at Xi = 50. 

In general, additive or noninteracting predictor variables lead to parallel contour curves, 
whereas interacting predictor variables lead to nonparallel contour curves. 


Interpretation of Interaction Regression Models with Curvilinear Effects 

When one or more of the predictor variables in a regression model have curvilinear effects 
on the response variable, the presence of interaction effects again leads to response functions 
whose contour curves are not parallel. Figure 8.9a shows the response surface for a study 
of the volume of a quick bread: 

E{Y] = 65 + 3Xi + 4X 2 - lOXf — 15X 》 十 35XiX 2 

Here, Y is the percentage increase in the volume of the quick Ipread from baking，Xi is the 
amount of a leavening agent (coded), and X 2 is the oven temperature (coded). Figure 8.9b 
shows contour curves for this response function. Note the lack of parallelism in the contour 
curves, reflecting the interaction effect. Figure 8.1Q presents a conditional effects plot to 
show in a simple fashion the nature of the interaction in the relation of oven temperature (X 2 ) 
to the mean volume when leavening agent amount (Xi) is held constant at different levels. 
Note that increasing oven temperature increases volume when leavening agent amount is 
high, and the opposite is true when leavening agent amount is low. 



310 Part Two Multiple Linear Regression 












Chapter 8 Regression Models for Quantitative and Qualitative Predictors 311 


■ IRE 8 9 Response Surface and Contour Curves for Curvilinear Regression Model with Interaction 
J^^Quick Bread Volume Example. 

(a) Fitted Response Surface (b) Contour Plot 



FIGURE 8.10 
Cmditiona] 
Effects Plot for 
Curvilinear 
R^ression 
Model with 
Interaction 
Effect—Quick 
Bread Volume 
Example. 


Implementation of Interaction Regression Models 

The fitting of interaction regression models is routine, once the appropriate cross-product 
terms have been added to the data set. Two considerations need to be kept in mind when 
developing regression models with interaction effects. 

1. When interaction terms are added to a regression model, high multicollinearities may 
exist between some of the predictor variables and spme of the interaction terms, as well as 
among some of the interaction terms. A partial remedy to improve computational accuracy 
is to center the predictor variables; i.e., to use x ik = 

2. When the number of predictor variables in the regression model is large, the poten¬ 
tial number of interaction terms can become very large. For example, if eight predictor 



X2 



312 Part Two'' Multiple Linear Regression 


variables are present in the regression model in linear terms, there are potentially 28 p a y. 
wise interaction terms that could be added to the regression model. The data set would need 
to be quite large before 36 X variables could be used in the regression model. 


It is therefore desirable to identify in advance, whenever possible, those interactions 
that are most likely to influence the response variable in important ways. In addition to 
utilizing a priori knowledge, one can plot the residuals for the additive regression model 
against the different interaction terms to determine which ones appear to be influential 
in affecting the response variable. When the number oi predictor variables is large, these 


plots may need to be limited to interaction terms involving those predictor variables that 
appear to be the most important on the basis of the initial fit of the additive regression 
model. 


Example 


We wish to test fonnally in the body fat example of Table 7.1 whether interaction terms be¬ 
tween the three predictor variables should be included in the regression model. We therefore 
need to consider the following regression model: 


K = A) + /Hi + P2^i2 + ^3-^(3 + ^4^il + 05兄1 X ,。 + + £, (8.28) 

This regression model requires that we obtain the new variables X 2 , X|X 3 , and X 2 X 3 
and add these X variables to the ones in Table 7.1. We find upon examining these X variables 
that some of the predictor variables are highly correlated with some of the interaction 
terms, and that there are also some high correlations among the interaction terms. For 
example, the correlation between X| and X 1 X 2 is .989 and that between H and X 2 X 3 
is .998. 

We shall therefore use centered variables in the regression model: 


Yi = A) + p\ x i\ + + + 2 + ^ x il x i3 + A^/2-^/3 + £, (8.29) 


where: 


Xj I = X,- [ 一 X I = Xj I 一 25.305 
AV 2 = ^,2-X 2 = X /2 -51.170 
Xj^ = X/3 一 = X/3 — 27.620 


Upon obtaining the cross-product terms using the centered variables, we find that the in¬ 
tercorrelations involving the cross-product terms are now smaller. For example, the largest 
correlation, which was between X|X 3 and X 2 X 3 , is reduced from .998 to .891. Othei' cor¬ 
relations are reduced in absolute magnitude even more. 

Fitting regression model (8.29) yields the following estimated regression function, mean 
square error, and extra sums of squares: 


P = 20.53 + 3.438.i[ — 2.095x2 — 1.616^3 + .00888 x|JC 2 — .08479jfijc：3 + .09042 j ： 2 义 3 
MSE = 6.745 



Chapter 8 Regression Models for Quantitative and Qualitative Predictors 313 


Variable 

Extra Sum of Squares 

Xl 

55/?(xi)= 

352.270 

x 2 

55/?(x2|^i) = 

33.169 


55/?(x 3 [x 1# X2) = 

11.546 

xix 2 

S5/?(X"i X21 / ^2 / ^ 3 ) = 

1.496 

xix 3 

x 2/ x 3/ xix 2 )= 

2.704 

X2X3 

SSR(x 2 x 3 lx-i t x 2/ x 3/ x-,x 2/ x-,x 3 ) = 

6.515 


We wish to test whether any interaction terms are needed: 


Ho ： ^4 = ~ o 

H a : not all in Ho equal zero 

L 

The partial F test statistic (7.27) requires here the following extra sum of squares: 


SSR(x x X 2 , X 1 X 3 , x 2 x^\xi, x 2 , x^) = 1.496 + 2.704 + 6.515 = 10.715 

〆 

and the test statistic is: 


_ SSR(XjX 2 , XjX 3 , x 2 x 3 \xi,x 2 , x 3 ) 二 


10.715 


+ 6.745 = .53 


For level of significance or = .05, we require F (.95; 3, 13) = 3.41. Since F* = .53 < 3.41 ， 
we conclude Ho, that the interaction terms are not needed in the regression model. The 
尸 -value of this test is .67. 


8.3 Qualitative Predictors 


As mentioned in Chapter 6, qualitative, as well as quantitative, predictor variables can be 
used in regression models. Many predictor variables of interest in business, economics, 
and the social and biological sciences are qualitative. Examples of qualitative predictor 
variables are gender (male, female), purchase status (purchase, no purchase), and disability 
status (not disabled, partly disabled, fully disabled). 

In a study of innovation in the insurance industry, an economist wished to relate the speed 
with which a particular insurance innovation is adopted (F) to the size of the insurance firm 
(Xi) and the type of firm. The response variable is measured by the number of months 
elapsed between the time the first firm adopted the innovation and the time the given firm 
adopted the innovation. The first predictor variable, size of firm, is-quantitative, and is 
measured by the amount of total assets of the fiijn. The second predictor variable, type of 
firm, is qualitative and is composed of two classes—-stock companies and mutual companies. 
In order that such a qualitative variable can be used in a regression model, quantitative 
indicators for the classes of the qualitative variable must be employed. 




314 Part Two Multiple Unear Regression 


Qualitative Predictor with Two Classes 

There are many ways of quantitatively identifying the classes of a qualitative variable. \y e 
shall use indicator variables that take on the values 0 and 1. These indicator variables are 
easy to use and are widely employed, but they are by no means the only way to quantify a 
qualitative variable. 

For the insurance innovation example, where the qualitative predictor variable has two 
classes, we might define two indicator variables X 2 and X 3 as follows: 


X 2 


1 if stock company 
0 otherwise 

I if mutual company 
0 otherwise 


(8.30) 


A first-order model then would be the following: 

K = Ai + A [ 十 ^yXf2 + + £/ 


(8.31) 


This intuitive approach of setting up an indicator variable for each class of the qualitative 
predictor variable unfortunately letids to computational difficulties. To see why, suppose 
we have n = 4 observations, the first two being stock firms (for which X 2 = 1 and X 3 = 0), 
and the second two being mutual firms (for which X 2 = 0 and X 3 = 1). The X miitrix would 
then be: 

X, X 2 X 3 

X n 1 O' 

X 2l i 0 

X31 0 1 

X 4 \ 0 1 


x = 


Note that the first column is equal to the sum of the X 2 and X 3 columns, so that the columns 
are linearly dependent according to definition (5.20). This has a serious effect on the X’X 


matrix ： 


o o \ 1 
1100 


I n I l 

— 9-34 - 

xxxx2 


it - 1 


1X40 I 
1^-01 
1X21 
lXM 


2 


x; , 

■s 

X, 

-E/=l 


o 


2 


2 


o 


•/ "1 / / -/ 
xxxx 
4E/=l4E/=l2ETE? 


o 


o 4 4 


X:, 

El 


2 


2 


X 

x^ 



Chapter 8 Regression Models for Quantitative and Qualitative Predictors 315 


We see that the first column of the X’X matrix equals the sum of the last two columns, 
so that the columns are linearly dependent. Hence, the X^X matrix does not have an inverse, 
and no unique estimators of the regression coefficients can be found. 

A simple way out of this difficulty is to drop one of the indicator variables. In our 
example, we might drop X 3 . Dropping one indicator variable is not the only way out of the 
difficulty, but it leads to simple interpretations of the parameters. In general, therefore, we 
shall follow the principle: 

A qualitative variable withe classes will be represented by c — 1 ( 

indicator variables, each taking on the values 0 and 1. V • J 


Comment 

Indicator vari 址 >Ies are frequently also called dummy variables or binary variables. The latter term 
has reference to the binary number system containing only 0 and 1. ■ 

Interpretation of Regression Coefficients 

Returning to the insurance innovation example, suppose that we drop the indicator variable 
X 3 from regression model (8.31) so that the model becomes: 

3^/ = jSo + AX/i 十 ^ 2 .Xi 2 + £i (8.33) 

where: 


L 


Xn — size of firm 

X _ fl if stock company 
12 — [0 if mutual company 

The response function for this regression model is: 

E{Y] = A) + PiXi + PiX 2 (834) 

To understand the meaning of the regression coefficients in this model, consider first the 
case of a mutual firm. For such a firm, X 2 = 0 and response function (8.34) becomes: 

= A ) 十 jSiXi + 佐 (0) = A) + piX\ Mutual firms (8.34a) 

Thus, the response function for mutual firms is a straight line, with Y intercept j6o and slope 
Pi. This response function is shown in Figure 8.11. 

For a stock firm, X 2 = 1 and response function (8.34) becomes: 

E{Y\ = 汍十 + 此⑴ = (A) + 免 ） + PiXi Stock firms (8.34b) 


This also is a straight line, with the same slope pi but with F intercept po + Pi -This response 
function is also shown in Figure 8.11. 

Let us consider now the meaning of the regression coefficients in response function (8.34) 
with specific reference to the insurance innovation example. We see that the mean time 
elapsed before the innovation is adopted, E{Y), is a linear function of size of firm (Xi), 
with the same slope for both types of firms. indicates how mucfi higher (lower) the 
response function for stock firms is than the one f6r mutual firms, for any given size of firm. 
Thus, ^2 measures the differential effect of type of firm. In general, j0 2 shows how much 
higher (lower) the mean response line is for the class coded 1 than the line for the class 
coded 0, for any given level of Xi ， 



316 Part Two Multiple Linear Regression 


FIGURE 8.11 
Dlustration of 
Meaning of 
Regression 
Coefficients for 
Regression 
Model (8.33) 
with Indicator 
Variable 
X 2 ― Insurance 
Innovation 
Example. 


Example 


Number of 
Months Elapsed 



Size of Firm 


In the insurance innovation example, the economist studied 10 mutual firms and 10 stock 
firms. The basic data are shown in Table 8.2, columns 1—3. The indicator coding for type 
of firm is shown in column 4. Note that X 2 = 1 for each stock firm and X 2 = 0 for each 
mutual firm. 

The fitting of regression model (8.33) is now straightforward. Table 8.3 presents the key 
results from a computer run regressing F on Xj and X 2 . The fitted response function is: 

Y = 33.87407 - .10174 幻十 8.05547X 2 

Figure 8.12 contains the fitted response function for each type of firm, together with the 
actual observations. 

The economist was most interested in the effect of type of firm (X 2 ) on the elapsed time 
for the innovation to be adopted and wished to obtain a 95 percent confidence interval for 
02 . We require r (.975; 17) = 2.110 and obtain from the results in Table 8.3 the confidence 
limits 8.05547 士 2.110(1.45911). The confidence interval for ^ 2 . therefore is: 

4.98 <p 2 < il.13 


Thus, with 95 percent confidence, we conclude that stock companies tend to adopt the inno¬ 
vation somewhere between 5 and 11 months later, on the average, than mutual companies, 
for any given size of firm. 

A formal test of: 


Mo ： ^2 = 0 

H a ： J^2 7^ 0 



Chapter 8 Regression Models for Quantitative and Qualitative Predictors 317 


Estimated 

Regression Coefficient 

- 33-87407 

-.1,0174 
8.JD5547 


Bstitnated 

Standard Deviation t* 

1.81386 18.68 

.00889 -11,44 

1.45^11 5.52 


(b) Analysis of Variance 


Source of 
Variation 

55 

df 

MS 

Regression 

1,504.41 - 

2 

752.20 

Error 

176.39 

17 

10.38 

Jam 

1,680.80 

19 

#• 


Results for Fit 
of Regression 

Model (8.33)— 
Insu ranee 
Innovation 
Example. 


( 2 ) 

Size of Firm 
(million dollars) 
' 乂 n 

151 

92 

175 

31 

104 

277 

210 

120 

290 

238 

164 

272 

295 

68 

85 

224 


TABLE 8.3 

Reeresaon 


(a) Regression Coefficients 


(4) 

Indicator 

Code 

Xi 2 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

1 

1 




(S) 


X,*1 X}2 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

164 

272 

295 

68 

85 

224 

166 

305 

124 

246 


⑴ 

Number of 
Months Elapsed 

Yi 

17 

26 

- -^ ! 

21 

30 
22 

0 

12 

19 
4 

16 

28 

15 

11 

38 

31 
21 

20 

13 
30 

14 


TABLE 8.2 

Data and 
Indicator 



Insurance 

Innovation 

Example. 


with level of significance .05 would lead to H a -> th^t type of firm has an effect, since the 
95 percent confidence interval for ^ does not include zero. 

The economist also carried out other analyses, some of which will be described shortly. 

Comment 

The reader may wonder why we did not simply fit separate regressions for stock firms and mutual 
firms in our example, and instead adopted the approach of fitting one regression with an indicator 


sic iel 

;l ffic)60Je1Je2 

9 )e 

e o 

R c 


6 5 4 6 
16301224 


. 1 aaaaaaaaaa kk l k k k k k k k k 

sRelutu utu lutu utu utuutulutuutustoe's's=^ 
y F MMM M M M MMM M s s s s s1 s s s s s 


s s 




234567890123^5 6^/890 

II- ^1 ^Bs- - ABf 



318 Part Two Multiple Linear Regression 


« Stock Firms Response Function: 

Y = (33.87407 + 8.05547) - .101744 


Mutual Firms Response Function: 


Y= 33.87407 - .10^74X 1 



FIGURE 8.12 

Fitted 

Regression 

Functions for 

Regression 

Model (8.33)— 

Insurance 

Innovation 

Example. 


10 


5 



« Stock Firm 
o Mutual Firm 


0 


0 


50 100 


150 200 

Size of Firm 


250 300 X 1 


variable. There are two reasons for this. Since the model assumes equal slopes and the same coistant 
error term variance for each type of firm, the common slope 洗 can best be estimated by pooling 
the two types of firms. Also, other inferences, such as for 仇 and 佐 ， can be made more precisely by 
working with one regression model containing an indicator variable since more degrees of freedom 
will then be associated with MSE. ■ 


Qualitative Predictor with More than Two Classes 

If a qualitative predictor variable has more than two classes, we require additional indicator 
variables in the regression model. Consider the regression of tool wear (F) on tool speed 
(Xi) and tool model，where the latter is a qualitative variable with four classes (Ml, M2, 
M3, M4). We therefore require three indicator variables. Let us define them as follows ： 


1 if tool model Ml 
0 otherwise 

1 if tool model M2 
0 otherwise 

1 if tool model M3 
0 otherwise 


First-Order Model. A first-order regression model is: 

= A) 十 + ^ 2-^/2 + 03^i3 + 04^i4 + £/ 



(835) 


(836) 


K 


0 5 0 5 0 5 
4 3 3 2 2 1 

Pasd ro l3 SL^UOIAI wo JaqEnz 



Chapter 8 Regression Models for Quantitative and Qualitative Predictors 319 


For this model, the data input for the X variables would be as follows: 


Tool Model 

入 1 

x 2 


x 4 

Ml 

X/1 

1 

0 

0 

M2 

X/T 

0 

1 

0 

M3 

X/i 

0 

0 

1 

M4 

Xn 

0 

0 

0 


The response function for regression model (8.36) is: 

E{Y] = + 十 十 &X 4 (8.37) 

To understand the meaning of the regression coefficients, consider first what response 
function (8.37) becomes for tool models M4 for which X 2 = 0, X 3 = 0, and X 4 = 0: 

' £{F} = + Tool models M4 (8.37a) 

For tool models Ml, X 2 = 1, X 3 = 0, and X 4 = 0, and response function (8.37) becomes; 

E{Y} = (Po + P2)-\-^X 1 Tool models Ml (8.37b) 

Similarly, response functions (8.37) becomes for tool models M2 and M3: 

E[Y] = (/So + + P\Xi Tool models M2 (8.37c) 

E{Y) = (j 6 o + 氏）十 AX! Tool models M3 (8.37d) 


Thus, response function (8.37) implies that the regression of tool wear on tool speed is 
linear, with the same slope for all four tool models. The coefficients 你 , 你 ， and ^4 indicate, 
respectively, how much higher (lower) the response functions for tool models Ml, M2, and 
M3 are than the one for tool models M4, for any given level of tool speed. Thus, ^ 2 , ^ 3 , and 
04 measure the differential effects of the qualitative variable classes on the height of the 
response function for any given level of X 1s always compared with the class for which X 2 = 
X 3 = X 4 = 0. Figure 8.13 illustrates a possible arrangement of the response functions. 

When using regression model (8.36), we may wish to estimate differential effects other 
than against tool models M4. This can be done by estimating differences between regression 
coefficients. For instance, ^ — measures how much higher (lower) the response function 
for tool models M3 is than the response function for todl models M2 for any given level of 
tool speed, as may be seen by comparing (8.37c) and (8.37d). The point estimator of this 
quantity is, of course, — and the estimated variance of this estim 扯 or is: 

s 2 [b 4 — bs} = s 2 {b 4 ] + s 2 ^} - 2s{b A , h} * (8.38) 

The needed variances and covariance can be readily obtained from the estimated variance- 
covariance matrix of the regression coefficients. 

Time Series Applications 

Economists and business analysts frequently use time series data in regression analysis. 
Indicator variables often are useful for time series regression models. For instance, savings 
(F) may be regressed on income (X), where both the savings and income data are annual 



320 Part Two Multiple Linear Regression 


FIGURE 8.13 
Dlustration of 
Regression 
Model (8.36) — 
Tool Wear 
Example. 


Tool Wear Tool Models M3 ： E{Y} = (j8 0 + jS 4 ) + 



Tool Speed 



data for a number of years. The model employed might be: 

^ X t + s t t = 1, ...,« (839) 

where Y t and X t are savings and income, respectively, for time period t. Suppose that the 
period covered includes both peacetime and wartime years, and that this factor should be 
recognized since savings in wartime years tend to be higher. The following model might 
then be appropriate: 

K = A) 十氏叉 ri 十 P2^t2 + £■( (8.40) 

where: 



income 

1 if period t peacetime 
0 otherwise 


Note that regression model (8.40) assumes that the maiginal propensity to save (/6i) is 
constant in both peacetime and wartime years, and that only the height of the response 
function is affected by this qualitative variable. 

Another use of indicator variables in time series applications Occurs when monthly 
or quarterly data are used. Suppose that quarterly sales (F) are regressed on quarterly 
advertising expenditures (Xi) and quarterly disposable personal income (X 2 ). If seasonal 
effects also have an influence on quarterly sales, a first-order regression model incorporating 



Chapter 8 Regression Models for Quantitative and Qualitative Predictors 321 


seasonal effects would be: 

Y t 二 j5 0 + j5an + 佐兄 2 + jS 3 X r3 + + J6 5 x f5 + s f (8.41) 


where: 

X,i = quarterly advertising expenditures 

X /2 = quarterly disposable personal income 

if first quarter 
otherwise 

if second quarter L 

otherwise 

if third quarter 
otherwise 

Regression models for time series data are susceptible to correlated error terms. It is 
particularly important in these cases to examine whether the modeling of the time series 
components of the data is adequate to make the error terms uncorrelated. We discuss in 
Chapter 12 a test for correlated error terms and a regression model that is often useful when 
the error terms are correlated. 



8.4 Some Considerations in Using Indicator Variables_ 

Indicator Variables versus Allocated Codes 

An alternative to the use of indicator variables for a qualitative predictor variable is to em¬ 
ploy allocated codes. Consider, for instance, the predictor variable “frequency of product 
use” which has three classes: frequent user, occasional user, nonuser. With the allocated 
codes approach, a single X variable is employed and values are assigned to the classes; for 
instance: 


Class 

Xi 

Frequent user 

3 

Occasional user 1 

2 

Nonuser 

1 


The allocated codes are, of course, aAitrary and could be other sets of numbers. The first- 
order model with allocated codes for our example, assuming no other predictor variables, 
would be: 


Yi = A ) 十 jSiX/i + S/ (8.42) 

The basic difficulty with allocated codes is that they define a metric for the classes of the 
qualitative variable that may not be reasonable. To see this concretely, consider the mean 



322 Part Two Multiple Linear Regression 


responses with regression model (8.42) for the three classes of the qualitative variable ： 


Class 


£{” 

Frequent user 

E{Y} 

=A) + A (3) = + 3 灼 

Occasional user 

f{” 

= 汍 + 灼 (2)= 汍 + 2 灼 

Nonuser 


= A) + 灼 (1) = A) + 灼 


Note the key implication: 

£{F|frequent user} — £{)^| occasional user} = E{K|occasional user} — £{F|ponuser} = jg. 

Thus, the coding 1, 2, 3 implies that the mean response changes by the same amount when 
going from a nonuser to an occasional user as when going from an occasional user to a 
frequent user. This may not be in accord with reality and is the result of the coding 1,2,3, 
which assigns equal distances between the three user classes. Other allocated codes may, of 
course, imply different spacings of the classes of the qualitative variable, but these would 
ordinarily still be arbitrary. 

Indicator variables, in contrast, make no assumptions about the spacing of the classes 
and rely on the data to show the differential effects that occur. If, for the same example, two 
indicator variables, say, X! and X 2 , are err^loyed to represent the qualitative variable, as 
follows: 


Class 

Xi 

x 2 

Frequent user 

1 

0 

Occasional user 

0 

1 

Nonuser 

0 

0 


the first-order regression model would be: 

Yi = 十 A 2 X/ 2 十 〜 (8.4B) 

Here, jS, measures the differential effect: 

Ej) 7 (frequent user} — E{FJnonuser} 
and ^2 measures the differential effect: 

£{y I occasional user} — £{yjnonuser} 

Thus, ^2 measures the differential effect between occasional user and nonuser, and 仏 一 仿 
measures the differential effect between frequent user and occasional user. Notice that there 
are no arbitrary restiictions to be satisfied by these two differential effects. Also note that 
if Pi — 2 爲 2 , then equal spacing between the three classes would exist. 

Indicator Variables versus Quantitative Variables 

Indicator variables can be used even if the predictor variable is quantitative. For instance, the 
quantitative variable age may be transformed by grouping ages into classes such as under 



Chapter 8 Regression Models for Quantitative and Qualitative Predictors 323 


21 ， 21 —34, 35-49, etc. Indicator variables are then used for the classes of this new predictor 
variable. At first sight, this may seem to be a questionable approach because information 
about the actual ages is thrown away. Furthermore, additional parameters are placed into 
the model, which leads to a reduction of the degrees of freedom associated with MSE. 

Nevertheless, there are occasions when replacement of a quantitative variable by indicator 
variables may be appropriate. Consider a large-scale survey in which the relation between 
liquid assets (F) and age (X) of head of household is to be studied. Two thousand households 
were included in the study, so that the loss of 10 or 20 degrees of freedom is immaterial. 
The analyst is very much in doubt about the shape of the regression function, which could 
be highly complex, and hence may utilize the indicator variable approach in order to obtain 
information about the shape of the response function without making any%ssumptions about 
its functional form. ' 

Thus, for large data sets use of indicator variables can serve as an alternative to lowess 
and other nonparametric fits of the response function. ^ 

Other Codings for Indicator Variables 

As stated earlier, many different c odings of indicator variables are possible. We now describe 
^two alternatives to our 0, 1 coding for c — 1 indicator variables for a qualitative variable 
with c classes. We illustrate these alternative codings for the insurance innovation example, 
where Y is time to adopt an innovation, X\ is size of insurance firm, and the second predictor 
variable is type of company (stock, mutual). 

The first alternative coding is: 

_ J 1 if stock company 
2 — [ —1 if mutual company 

For this coding, the first-order linear regression model: 

— j^o + P\ Xi\ + ^2X;2 + Si 

has the response function: 

This response function becomes for the two types of companies: 

五 {il = (A) 十 Az) + Stock firms 

E[Y} = (po — p 2 ) + Mutual firms 

f 

Thus, jBo here may be viewed as an^average” intercept of the regression line, from which 
the stock company and mutual company intercepts differ by ^2 i n opposite directions. A test 
whether the regression lines are the same for both type^of companies involves Ho ： 你 = 0, 
H a :h 竽 Q. • , 

A second alternative coding scheme is to use a 0, 1 indicator variable for each of the c 
classes of the qualitative variable and to drop the intercept term in the regression model. 
For the insurance innovation example, the model would be: 


(8.44) 

(8.45) 

(8.46) 

(8.46d) 

(8.46b) 


Y t = Xfl + ^2^i2 + A^/3 + e i 


( 8 . 47 ) 



324 Part Two Multiple Linear Re^ressUfn 


where: 


Xyt 

Xn = 

— 


size of firm 
1 if stock company 
0 otherwise 

1 if mutual company 


0 otherwise 


Here, the two response functions are: 

E{Y}=^ + ^X ] 


Stock firms 
Mutual firms 


(8.48a) 

(848b) 


A test of whether or not the two regression lines are the same would involve the alternatives 
Hg- ^2 — 芦 3 , H u : 爲 2 參 This type of test, discussed in Section 7.3, cannot be conducted 
by using extra sums of squares and requires the fitting of r ,both the full and reduced models. 


8.5 Modeling Interactions between Quantitative 
_and Qualitative Piedictors_ 

In the insurance innovation example, the economist actually did not begin the analysis with 
regression model (8.33) because of the possibility of interaction effects between size of 
firm and type of firm on the response variable. Even though one of the predictor variables 
in the regression model here is qualitative, interaction effects can still be introduced into 
the model in the usual manner, by including cross-product terms. A first-order regression 
model with an added interaction term for the insurance innovation example is: 

= A> + + ^2^/2 + ftX,-1 X/2 + £; (8.49) 

where: 

Xn — size of firm 

_ J1 if stock company 
,2 ~ [0 otherwise 

The response function for this regression model is: 

五 {W = ft + 灼 X 丨十佐 X 2 + ^X,X 2 (8.50) 

Meaning of Regression Coefficients 

The meaning of the regression coefficients in response function (8.50) can best be understood 
by examining the nature of this function for each type of firm. For a mutual firm, X 2 = 0 
and hence = 0. Response function (8.50) therefore becomes for mutual firms: 

£{ = A> + AX 丨十佐⑼ + ft ⑼ = ^ 0 + ^iX l Mutual firms (8.50a) 

This response function is shown in Figure 8.14. Note that the Y intercept is ^ and the slope 
is j6, for the response function for mutual firms. 



Chapter 8 Regression Models for Quantitative and Qualitative Predictors 325 


FIGURE 8.14 

Illustration of 

leaning of 

R^ression 
Coefficients for 
Regression 

jylodel (8.49) 
vvith Indicator 
你 rfaWe X 2 

and Interaction 

Term 一 

Insurance 

Innovation 

Example. 


Number of 
Months Elapsed 



Size of Firm 


L 


For stock firms, X 2 = 1 and hence X 1 X 2 = Xi ，Response function (8.50) therefore be¬ 
comes for stock firms: 


E{Y} = ^ + ^X, + J S 2 (1) + J S 3 X 1 


or: 


E{Y\ = (^0 + i6 2 ) + (i6i + ⑻ Xi Stock firms (8.50b) 

This response function is also shown in Figure 8.14. Note that the response function for 
stock firms has Y intercept 卢 0 + 卢 2 and slope 仇 + 卢 3 . 

We see that here indicates how much greater (smaller) is the Y intercept of the response 
function for the class coded 1 than that for the class coded 0. Similarly, p 3 indicates how 
much greater (smaller) is the slope of the response function for the class coded 1 than that 
for the class coded 0. Because both the intercept and the slope differ for the two classes in 
regression model (8.49), it is no longer true that ^2 indicates how much higher (lower) one 
response function is than the other for any given level of X\. Figure 8.14 shows that the 
effect of type of firm with regression model (8.49) depends on Xi, the size of the firm. For 
smaller firms, according to Figure 8.14, mutual firms tend to innovate more quickly, but for 
laiger firms stock firms tend to innovate more quickly. Thus, when interaction effects are 
present, the effect of the qualitative predictor variable can be studied only by comparing the 
regression functions within the scope of the podel for the different classes of the qualitative 
variable. 

Figure 8.15 illustrates another possible interaction pattern for the insurance innovation 
example. Here, mutual firms tend to introduce the innovation more quickly than stock firms 



326 Part Two Multiple Linear Repress ton 



Example 


for all sizes of firms in the scope of the model, but the differential effect is much smaller 
for large firms than for small ones. 

The interactions portrayed in Figures 8.14 and 8.15 can no longer be viewed as reinforcing 
or interfering types oi interactions because one of the predictor variables here is qualitative. 
When one of the predictor variables is qualitative and the other quantitative, nonparallel 
response functions that do not intersect within the scope of the model (as in Figure 8.15) are 
sometimes said to represent an ordinal interaction. When the response functions intersect 
within the scope of the model (as in Figure 8.14), the interaction is then said to bendisordinal 
interaction. 

Since the economist was concerned that interaction effects between size and type of firm 
may be present, the initial regression model fitted was model (8.49): 

Yi ~ + ^i\ + + 办十 

The values for the interaction term X\X 2 for the insurance innovation example are shown 
in Table 8.2, column 5, on page 317. Note that this column contains 0 for mutual companies 
and X, | for stock companies. 

Again, the regression fit is routine. Basic results from a computer run regressing Y on 
X|, X 2 , and X\X 2 are shown in Table 8.4. To test for the presence of interaction effects: 

H 0 ： ft = 0 
H a : ft ^ 0 

the economist used the t* statistic from Table 8.4a: 

, b 、 -.0004171 ^ 

t — - = - = —.02 

s{b^} .01833 



Chapter 8 Regression Models for Quantitative and Qualitative Predictors 327 


(a) Regression Coefficients 


Estimated 

Regression Coefficient 

Estimated 
Standard Delation 

t* 

33.83837 

2.44065 

13.86 

-.10153 

•01305 

一 7.78 

8.13125 

3.65405 

2.23 

-.0004171 

.01833 

-.02 


(b) y^ialysls of Variance 


Source of 
Variation 

55 

df 

MS 

Regression 

1,504.42 

3 

501.47 

Error 

176.38 

16 

11.02 

Total 

1,680.80 

19 



i 


l 


For level of significance .05, we require t(.915; 16) = 2.120. Since 11 * \ = .02 < 2.120, 
we conclude Ho, that ^ = 0. The conclusion of no interaction effects is supported by the 
two-sided 尸 - value for the test, which is very high, namely, .98. It was because of this result 
that the economist adopted regression model (8.33) with no interaction term, which we 
discussed earlier. 

Comment 

Fitting regression model (8.49) yields the same response functions as would fitting separate regressions 
for stock firms and mutual firms. An advantage of using model (8.49) with an in^cator variable is 
that one regression run will yield both fitted regressions. 

Another advantage is that tests for comparing the regression functions for the different classes of 
the qualitative variable can be cleariy seen to involve tests of regression coefficients in a general linear 
model. For instance. Figure 8.14 for the insurance innovation example shows that a test of whether 
the two regression functions have the same slope involves: 

H 0 : 灼 = 0 
H a : 灼 # 0 

Similarly, Figure 8.14 shows that a test of whether the two regression functions are identical involves: 

Ho ： fi 2 = ^3 = 0 

H a : not both ^ = 0 and 你 = 0 

■ 

§•6 More Compl ex Models_ ： _ 

We now briefly consider more complex models involving quantitative and qualitative 
predictor variables. 


fe 


Fit 


r for l I l nce r 



328 Part Two Multiple Linear Regression 


More than One Qualitative Predictor Variable 

Regression models can readily be constructed for cases where two or more of the predict^ 
variables are qualitative. Consider the regression of advertising expenditures (K) on sales 
(X]), type of firm (incorporated, not incorporated), and quality of sales management (higjj 
low). We may define: ’ 


X2 — 


X 



if firm incorporated 
otherwise 

if quality of sales inanagement high 
otherwise 


(8.51) 


First-Order Model. A first-order regression model for the above example is ； ； 

>■> 

V ,' — ^0 + ^1 ^(1 + ^ 2 ^ i 2 + ft ^/3 + £ i (8.52) ! 

V 

This model implies that the response function of advertising expenditures on sales is lineai; ? 
with the same slope for all “type of firm — quality of sales management” combinations - 
and ^2 and 色 indicate the additive differential effects of type of firm and quality of sales 二 
management on the height of the regression line for any given levels of X ^ and the other 
predictor variable. 

First-Order Model with Certain Interactions Added. A first-order regression model 
to which are added interaction effects between each pair of the predictor variables for the 
advertising example is: 


Yf == + X, | + + ^3-^(3 + AlX/l 凡 .2 十卢 5 U 1-3 + + £ i (8.53) 


Note the implications of this model: 


Type of 

Firm 

Quality of Sales 
Management 


Response Function 

Incorporated 

High 

E{Y} 

= (A) + ^2 + )63 + fie) + + ^4 + 

Not incorporated 

High 

B{Y} 

= (A) + 办 3) + ( 内 + 办 5 ) 入 1 

Incorporated 

Low 

B{Y) 

== (A) + #2) + (和 + #4) 入 1 

Not incorporated 

Low 

£!n 

= A) + 卢 1 入 1 


Not only are all response functions different for the various “type of firm — quality of sales 
management” combinations, but the differential effects of one qualitative variable on the 
intercept depend on the particular class of the other qualitative variable. For instance, when 
we move from “not incorpomted ■— low quality” to “incorporated — low quality," the intercut 
changes by ^ 2 - But if we move from “not incorporated — high quality" to “incorporated — 
high quality,” the intercept changes by + fk. 




Chapter 8 Regression Models for Quantitative and Qualitative Predictors 329 


Qualitative Predictor Variables Only 

Regression models containing only qualitative predictor variables can also be constructed. 
With reference to our advertising example, we could regress advertising expenditures only 
on type of firm and quality of sales management. The first-order regression model then 
would be: 


K = A ) 十 十 ^3^13 + £ i (8.54) 


where Xa and X ( - 3 are defined in (8.51). 


Comments 

1. Models in which all explanatory variables are qualitative are called analysis of variance 
models. 


2. Models containing some quantitative and some qualitative explanatory variables, where the 
chief explanatory variables of interest are qualitative and the quantitative variables are introduced 
primarily to reduce the variance of the error terms, are called analysis of covariance models. 


■ 


8 7 Comparison of Two or More Regression Functions_ 

Frequently we encounter regressions for two or more populations and wish to study their 
similarities and differences. We present three examples. 

1. A company operates two production lines for making soap bars. For each line, the 
relation between the speed of the line and the amount of scrap for the day was studied. 
A scatter plot of the data for the two production lines suggests that the regression relation 
between production line speed and amount of scrap is linear but not the same for the two 
production lines. The slopes appear to be about the same, but the heights of the regression 
lines seem to differ. A formal test is desired to determine whether or not the two regres¬ 
sion lines are identical. If it is found that the two regression lines are not the same, an 
investigation is to be made of why the difference in scrap yield exists. 

2. An economist is studying the relation between amount of savings and level of income 
for middle-income families from urban and rural areas, based on independent samples from 
the two populations. Each of the two relations can be modeled by linear regression. The 
economist wishes to compare whether, at given income levels, urban and rural families 
tend to save the same amount — i.e” whether the two regression lines are the same. If they 
are not, the economist wishes to explore whether at least the amounts of savings out of an 
additional dollar of income are the same for the two groups — i.e., whether the slopes of the 
two regression lines are the same. 

3. Two instruments were constructed fora company to identical specifications to measure 
pressure in an industrial process. A study was then made for each instrument of the relation 
between its gauge readings and actual pressures as determined by an almost exact but slow 
and costly method. If the two regression lines are the same, a single calibration schedule 
can be developed for the two instruments; otherwise, two different calibration schedules 
will be required. 



330 Part Two Multiple Linear Regression 


Production Line 2 



Amount 

Line 


Case 

of Scrap 

Speed 


« 

i 

Y, 

X n 

X }2 

16 

140 

105 

0 

17 

277 

215 

6 

18 

384 

270 

0 

19 

341 

255 

0 

20 

215 

175 

0 

2t 

780 

135 

0 

22 

260 

200 

0 

23 

361 

ns 

0 

24 

252 

155 

0 

25 

422 

320 

0 

26 

273 

190 

6 

27 

410 

295 

0 


TABLE 8.5 
Data~Soap 
Production 
Lines Example 
(all data are 
coded). 


When it is reasonable to assume that the error term variances m the regression models 
for the different populations are equal, we can use indicator variables to test the equally 
of the different regression functions. If the error variances are not equal, transformations of 
the response variable may equalize them at least approximately. 

We have already seen how regression models with indicator variables that contain inter¬ 
action terms permit testing of the equality of regression functions for the different classes 
of a qualitative variable. This methodology can be used directly for testing the equality of 
regression functions for different populations. We simply consider the different populations 
as classes of a predictor variable, define indicator variables for the different populations, and 
develop a regression model containing appropriate interaction terms. Since no new princi¬ 
ples arise in the testing of the equality of regression functions for different populations, we 
immediately proceed with two of the earlier examples to illustrate the ^^roach. 

Soap Production Lines Example 

The data on amount of scrap (7) and line speed (Xi) for the soap production lines example 
are presented in Table 8.5. The variable X 2 is a code for the production line. A symbolic 
scatter plot of the data, using different symbols for the two production lines, is shown in 
Figure 8.16. 

Tentative Model. On the basis of the symbolic scatter plot in Figure 8.16, the analyst 
decided to tentatively fit regression model (8.49). This model assumes that the regressicai 
relation between amount of scrap and line speed is linear for both production lines and that 
the variances of the error terms are the same, but permits the two regression lines to have 
different slopes and intercepts: 

Yi = + AXi + 02^i2 + 氏兄 | 十 e,. (8.55) 


1- 1J 1J 1- 1- 1- 1- 1 1. 1 


s 1 0505^5550 0500^5 
le ne 吣 o 2 2 o S 5 2 7 7 7 5 9 、 4 切 6 
.m u PA 112 232212111122 
L s 

l ntap 

rt u r 8 8 o 1 o 4 2 1 o o 1 T 5 5 7 
u o k 14 6 5 7 9 3 2 16 4 3 7 2 6 
d 223 3 433342232 43 
proArrof 

e 

1234 56789012345 
_- a 1- 1 1 .1 1 1 



Chapter 8 Regression Models for Quantitative and Qualitative Predictors 331 



FIGURE 8.16 

Symbolic 

Scatter 

piot-^Soap 

production 

Lines Example. 



o 100 150 200 250 300 350 X 

Line Speed (coded) 


where: 


line speed 

f 1 if production line 1 
[0 if production line 2 

： 1,2, ...,27 


Note that for purposes of this model, the 15 cases for production line 1 and the 12 cases for 
production line 2 are combined into one group of 27 cases. 

Diagnostics. A fit of regression model (8.55) to the data in Table 8.5 led to the results 
presented in Table 8.6 and the following fitted regression function: 

Y = 7.57 + 1.322X, + 90.39X 2 - AieiX^ 

Plots of the residuals against Y are shown in Figure 8.17 for each production line. Two plots 
are used in order to facilitate the diagnosis of possible differences between the two produc¬ 
tion lines. Both plots in Figure 8.17 are reasonably consistent with regression model (8.55). 
The splits between positive and negative residuals of 10 to 5 for production line 1 and 4 to 
8 for production line 2 can be accounted for by randomness of the outcomes. Plots of the 
residuals against X 2 and a normal probability plot of the residuals (not shown) also support 
the appropriateness of the fitted model. For the latter plot, the coefficient of correlation 
between the ordered residuals and their expected values under normality is .990. This is 

_ c 

sufficiently high according to Table B.6to support the assumption of normality of the error 
terms. , * 

Finally, the analyst desired to make,a formal test of the equality of the variances of 
the error terms for the two production lines, using the Brown-Forsythe test described in 
Section 3.6. Separate linear regression models were fitted to the data for the two production 
lines, the residuals were obtained, and the absolute deviations d t \ and d i2 in (3.8) of the 


I d% d% d% /% 

00000 

5 4 3 2 1 
(papo^dejuswolunoluv 



TABLE 8.6 

Regression 
Results for Fit 
of Regression 
Model (8.55)— 
Soap 

Production 
Lines Example. 


FIGURE 8.17 
Residual Plots 
against 
Y — Soap 
Production 
Lines Example. 


Regression 

Coeffident 


0 100 200 300 400 y 

residuals around the median residual for each 
The results were as follows: 


Source of 


； 


(a) Regression Coefficients 


Estimated 

Regression 

Coefficient 

7.57 

1.322 

90.39 

-.1767 


Estimated 

Standard 

Deviation 

20.87 

.09262 

28.35 

.1288 


(b) Analysis of Variance 


Production Line 1 

f = 97.965 + 1.145Xi 
d! =16.132 

X ： (4i-^i) 2 = 2,952.20 


332 Part Two Multiple Linear Regression 



Variation 

Regression 

(1 入 2 l 入 1/ 入2, 
Error 
Total 


55 

169,165 
149,661 
18,694 
810 
9,904 
179,069 


df 

3 

1 

1 


23 

26 


(a) Production Line 1 


e 

40 

20 

0 

-20 

-40 


- 

o 

0 

• • 

•• 



- 




1 

_ 1 _ 1 _ 1 _ 


lenpjsaa 


0 12 3 









Chapter 8 Regression Models for Quantitative and Qualitative Predictors 333 


The pooled variance s 2 in (3.9a) therefore is: 




2,952.20 + 2,045.82 


199.921 


27-2 

Hence, the pooled standard deviation is 5 = 14.139, and the test statistic in (3.9) is: 


f BF — 


16.132- 12.648 


.636 


14.139^/- + - 


For a — .05, we require t(.915; 25) = 2.060. Since | j = .636 < 2.060, we conclude 
that the error term variances for the two production lines do not differ. The two-sided 
P-value for this test is .53. 

At this point, the analyst was satisfied about the aptness of regression model (8.55) 
with normal error terms and was ready to proceed with comparing the regres^on relation 
between amount of scrap and line speed for the two production lines. 

Inferences about Two Regression Lines. Identity of the regression functions for the two 
production lines is tested by considering the alternatives: 


Mo- Pi = ^3 — 0 

H a : not both = 0 and ^ 


0 


(8.56) 


The appropriate test statistic is given by (7.27): 

SSR{X 2 ,X x X 2 \X,) . SSE(X l ,X 2 ,X l X 2 ) 


F* = 


2 


n 一 4 


(8.56a) 


where n represents the combined sample size for both populations. Using the regression 
results in Table 8.6, we find: 

SSR(X 2 , X l X 2 \X 1 ) = ^(XzIXOH- SSRiX^Xj, X 2 ) 

: 18,694+ 810 二 19,504 
19,504 9,904 


F* 


22.65 


2 23 

To control a at level .01, we require F(.99 ； 2, 23) = 5.67. Since F* = 22.65 > 5.67, we 
conclude H ai that the regression functions for the two production lines are not identical. 

Next, the analyst examined whether the slopes of the regression lines are the same. The 
alternatives here are: 

Ho ： Ps = 0 
1 H a : jS 3 0 


(8.57) 


and the appropriate test statistic is either the t* statistic (7.25) or the partial F test statis¬ 
tic (7.24): • 

r = ^(^ J X 2 |X 1 ,X 2 ) i SSEm xg 2 ) 

一 1 * n-A K ' ； 

Using the regression results in Table 8.6 and the partial F test statistic, we obtain: 

810 9,904 


F* 


23 


1.88 



334 Part Two Multiple Linear Regression 


For a = .01, we require jP(. 99; 1, 23) = 7.88. Since F* = 1.88 < 7.88, we conclude 
that the slopes of the regression functions for the two production lines are the same. ’ 
Using the Bonferroni inequality (4.2), the analyst can therefore conclude at family s jg_ 
nificance level .02 that a given increase in line speed leads to the same amount of increase 
in expected scrap in each of the two production lines, but that the expected amount of scrap 
for any given line speed differs by a constant amount for the two production lines. 

We can estimate this constant difference in the regression lines by obtaining a confidence 
interval for /3 2 . For a 95 percent confidence interval, we require r(.975; 23) = 2.069. Using 
the results in Table 8 . 6 , we obtain the confidence limits 90.39 士 2.069(28.35). Hence, the 
confidence interval for 馬 is: 

31-7 < < 149.0 ^ 

We thus conclude, with 95 percent confidence, that the mean amount of scrap for production 
line 1 , at any given line speed, exceeds that for production line 2 by somewhere between 
32 and 149. 

Instrument Calibration Study Example 

The engineer making the calibration study believed that the regression functions relat¬ 
ing gauge reading (K) to actual pressure (Xj) for both instruments are second-order 
polynomials: 

E{Y} ~ i + 

but that they might differ for the two instruments. Hence, the model employed (using a 
centered variable for X\ to reduce multicollineaiity problems 一 see Section 8.1) was: 

Yj = )6a + )6ix,i + PoxJi + + + P5 x f\ + (8.58) 

where: 

X [i = X,i — Xi = centered actual pressure 

1 if instrument B 
0 otherwise 

Note that for instrument A, where X 2 = 0, the response function is: 

£{y} = )60 + + ^ 2 ^ Instrument A (8.59a) 

and for instrument B, where Xn = 1, the response function is: 

E{F} = (jSq 十氏 ） + (jSi + + (决 + )S 5 )a 7 Instrument B (8.59b) 

Hence, the test for equality of the two response functions involves the alternatives: 

H ◦: 你 =^ = ~0 

H a : not all in Hq equal zero 

and the appropriate test statistic is (7.27): 

_ 55/?(X 2 ,x,X 2 . AfX 2 ]A,,xf) 55E(x,. xj, X 2 , AiX 2 ,xfX 2 ) ( 8 伽 ) 

= 3 - „ -6 ^ * 

where n represents the combined sample size for both populations. 





Comments 


Chapter 8 Regression Models for Quantitative and Qualitative Predictors 335 


1. The approach just described is completely general. If three or more populations are involved, 
additional indicator variables are simply added to the model. 

2. The use of indicator variables for testing whether two or more regression functions are the 
same is equivalent to the general linear test 坪 ) proach where fitting the full model involves fitting 
separate regressions to the data from each population, and fitting the reduced model involves fitting 
one regression to the combined data. 



8.1. Draper, N. R., and H. Smith. Applied Regression Analysis. 3rd ed. New York: John Wiley & 
Sons, 1998. 、 


Problems 


8.1. Prepare a contour plot for the quadratic response surface £{7} = 140 + 4x^ — 5x^2- 

Describe the sh 呼 e of the response surface. 

8.2. Prepare a contour plot for the quadratic response surface £{7} = 124 — 3xj — — 6xix 2 . 

Describe the shape of the response surface. 

8.3. A junior investment analyst used a polynomial regression model of relatively high order in a 
research seminar on municipal bonds and obtained an R 2 of .991 in the regression of net interest 
yield of bond (7) on industrial diversity index of municipality (X) for seven bond issues. A 
classmate, unimpressed, said: “You overfitted. Your curve follows the random effects in the 
data*” 

a. Comment on the criticism. 

b. Might defined in (6.42) be more appropriate than i? 2 as a descriptive measure here?' 

*8.4. Refer to Muscle mass Problem 1.27. Second-order regression model (8.2) with independent 
normal error terms is expected to be appropriate. 

a. Fit regression model (8.2). Plot the fitted regression function and the data. Does the quadratic 
regression function 坪 ) pear to be a good fit here? Find R 2 . 

b. Test whether or not there is a regression relation; use or = .05. State the alternatives, decision 
rule, and conclusion. 

c. Estimate the mean muscle mass for women aged 48 years; use a 95 percent confidence 
interval. Interpret your interval. 

d. Predict the muscle mass for a woman whose age is 48 years; use a 95 percent prediction 
interval. Interpret your interval. 

e. Test whether the quadratic term can be dropped from the regression model; use a = .05. 
State the alternatives, decision rule, and conclusion. 

f. Express the fitted regression function obtained in part (a) in terms of the original variable X. 

I 

g. Calculate the coefficient of simple correlation between X and X 2 and between x and x 2 . Is 
the use of a centered variable helpful here? 

*8.5. Refer to Muscle mass Problems 1.27 and 8.4. * 

a. Obtain the residuals from the fit in 8.4a and plot them against Y and against x on separate 
graphs. Also prepare a normal prob 址 ) ility plot. Interpret your plots. 

p 

b. Test formally for lack of fit of the quadratic regression function; use a = .05. State the 
alternatives, decision rule, and conclusion. What assumptions did you make implicitly in 
this test? 



336 Part Two Multiple Linear Regression 


a. Fit regression model (8.2). Plot the fitted regression function and the data. Does the quadratic 
regression function appear to be a good fit here? Find R 2 . 

b. Test whether or not there is a regression relation; use a = .01. State the alternatives, decision 
rule, and conclusion. What is the P-value of the test? 

c. Obtain joint interval estimates for the mean steroid level of females aged 10, 15, and 20 
respectively. Use the most efficient simultaneous estimation procedure and a 99 percent 
family confidence coefficient. Interpret your intervals. 

d. Predict the steroid levels of females aged 15 using a 99 percent prediction interval. Interpret 
your interval 

e. Test whether the quadratic term can be dropped from the model; use a = . 01 . State the 
alternatives, decision rule, and conclusion. 

f. Express the fitted regression function obtained in part (a) in terms of the original variable X. 

8.7. Refer to Steroid level Problem 8 . 6 . 

a. Obtain the residuals and plot them against the fitted values and against x on separate graphs. 
Also prepare a normal probability plot. What do your plots show? 

b. Test formally for lack of fit. Control the risk of a Type I error at .01. State the alternatives, 
decision rule, and conclusion. What assumptions did you make implicitly in this test? 

8 . 8 . Refer to Commercial properties Problems 6.18 and 7.7. The vacancy rate predictor (X 3 ) does 

not appear to be needed when property age (X ( ), operating expenses and taxes (X 2 ). and total 

square footage (X 4 ) are included in the model as predictors of rental rates (K). 

a. The age of the property (X!) appears to exhibit some curvature when plotted against the 
rental rates (7). Fit a polynomial regression model with centered property age (xO, 
the square of centered property age (A 7 ), operating expenses and taxes (X 2 ), and total 
square footage (X 4 ). Plot the Y observations against the fitted values. Does the response 
function provide a good fit? 

b. Calculate R~. What information does this measure provide? 

c. Test whether or not the the square of centered property age (jcf) can be dropped ftx)m the 
model; use a = .05. State the alternatives, decision rule, and conclusion. What is the P-value 
of the test? 

d. Estimate the mean rental rate when Xi = 8 , = 16, and = 250,000; use a 95 percent 

confidence interval ‘ Interpret your interval. 

e. Express the fitted response function obtained in part (a) in the original X variables. 

8.9. Consider the response function = 25 + 3Xi + 4X 2 + I‘5X|X 2 ‘ 

a. Prepare a conditional effects plot of the response function against X 1 when X 2 = 3 and 
when X 2 = 6 . How is the interaction effect of X 1 and X 2 on Y apparent from this graph? 
Describe the nature of the interaction effect. 


c. Fit third-order model (8.6) and test whether or not 户 " ,= 0: usea = .05. State the alternatives 
decision rule, aod conclusion. Is your conclusion consistent with your finding in pan (b)?’ 

8 .6. Steroid level. An endocrinologist was interested in exploring the relationship between the level 
of a steroid (7) and age (X) in healthy female subjects whose ages ranged from 8 to 25 y ears 
She collected a sample of 27 healthy females in this age range. The data are given below 

i: 1 2 3 25 26 27 


6 

8°- 

T* 2 


8 

4 °- 

12 


8 

3 2 - 


• 9 

T* 

2 2 


9 2 - 

2 


3 7 - 

2 2 




Chapter 8 Regression Models for Quantitative and Qualitative Predictors 337 


b. Plot a set of contour curves for the response surface. How is the interaction effect of Xi and 
X 2 on Y apparent from this graph? 

8-10. Consider the response function E{Y) =14 4- 1X\ + 5X2 — 4X\X 2 - 

a. Prqjare a conditional effects plot of the response function against X 2 when Xi = 1 and when 
X] =4. How does the graph indicate that the effects of Xi and X 2 on Y are not additive? 
What is the nature of the interaction effect? 

b. Plot a set of contour curves for the response surface. How does the graph indicate that the 
effects of X i and X 2 on Y are not additive? 

8.11. Refer to Brand preference Problem 6.5. 

a. Fit regression model (8.22), 

b. Test whether or not the interaction term can be dropped from th^model; use a = .05. State 
the alternatives, decision rule, and conclusion. 

8.12. A student who used a regression model that included indicator variables was upset when 
receiving only the following output on the multiple regression printout: XTRAl^SPOSE X 
SINGULAR. What is a likely source of the difficulty? 

8.13. Refer to regression model (8.33). Portray graphically the response curves for this model if 

Po = 25.3, = .20, and ^ = — 12.1. 

8.14. In a regression study of factors affecting learning time fora certain task (measured in minutes), 
gender of learner was included as a predictor variable (X 2 ) that was coded X 2 = 1 if male and 
0 if female. It was found that b 2 = 22.3 and s{bi\ = 3.8. An observer questioned whether the 
coding scheme for gender is fair because it results in a positive coefficient, leading to longer 
learning times for males than females. Comment. 

8.15. Refer to Copier maintenance Problem 1.20. The users of the copiers are either training in¬ 
stitutions that use a small model, or business firms that use a lai^e, commercial modeL—An 
analyst at Tri-City wishes to fit a regression model including both number of copiers serviced 
(Xi) and type of copier (X 2 ) as predictor variables and estimate the effect of copier model 
(S 一 small, L ― laige) on number of minutes spent on the service call. Records show that the 
models serviced in the 45 calls were: 

1 : 1 2 3 ... 43 44 45 

X , 2 ： S L L ... L L L 

Assume that regression model (8.33) is appropriate, and let X 2 = 1 if small model and 0 if large, 
commercial model. 

a. Explain the meaning of all regression coefficients in the model. 

b. Fit the regression model and state the estimated regression function, 

c. Estimate the effect of copier model on mean service time with a 95 percent confidence 
interval. Interpret your interval estimate. 

d. Why would the analyst wish to include Xi, number of copiers, in the regression model when 
interest is in estimating the effect of type of copier model on service time? 

e. Obtain the residuals and plot them against X\ X 2 - Is there any indication that an interaction 

term in the regression model would be helpful? ^ 

8.16. Refer to Grade point average Problem 1.19. An assistant to the director of admissions con¬ 
jectured that the predictive power of the model could be improved by adding information on 
whether the student had chosen a major field of concentration at the time the 坪 ) plication was 
submitted. Assume that regression model (8.33) is appropriate, where X| is entrance test score 



338 Part Two Multiple Linear Regresston 


and X ： = I if student had indicated a major field of concentration at the time of appii cat j 0n 
and 0 if the major field was undecided. Data for X 2 were as follows: 

/: 1 2 3 … 118 119 120 

X,2 ： 0 10.. 1 1 0 

a. Explain how each regression coefficient in model (8.33) is interpreted here. 

b. Fit the regression model and state the estimated regression function. 

c. Test whether the X] variable can be dropped from the regression model; use a = . 01 . State 
the alternatives, decision rule, and conclusion. 

d. Obtain the residuals for regression model (8.33) and plot them against X\X 2 . Is there any 
evidence in your plot that it would be helpful to include an interaction term in the model? 

8.17. Refer to regression models (8.33) and (8.49). Would the conclusion that ^ = 0 have the same 
implication for each of these models? Explain. 

8.18. Refer to regression model (8.49). Portray graphically the response curves for this model if 

= 25, fii = .30, f5 2 = —12.5, and ^ — .05. Describe the nature of the interaction effect. 
*8.19. Refer to Copier maintenance Problems 1.20 and 8.15. 

a. Fit regression model (8.49) and state the estimated regression function. 

b. Test whether the interaction term can be dropped from the model; control the a risk at .10. 
State the alternatives, decision rule, and conclusion. What h the P-value of the test? If the 
interaction term cannot be dropped from the model, describe the nature of the interaction 
effect. 

8.20. Refer to Grade point average Problems 1.19 and 8.16. 

a. Fit regression model (8.49) and state the estimated regression function. 

b. Test whether the interaction term can be dropped from the model; use a = .05. State the 
alternatives, decision rule, and conclusion. If the interaction term cannot be dropped from 
the model, describe the nature of the interaction effect. 

8.21. In a regression analysis of on-the-job head injuries of warehouse laborers caused by felling 
objects, K is a measure of severity of the injury, X! is an index reflecting both the weight of 
the object and the distance it fell, and X 2 and X? are indicator variables for nature of head 
protection worn at the time of the accident, coded as follows: 

Type of Protection X 2 

Hard hat 1 0 

Bump cap 0 1 

None 0 0 

The response function to be used in the study E{Y} = fio 4 - j6 2 X 2 4 - 

a. Develop the response function for each type of protection category. 

b. For each of the following questions, specify the alternatives Hq and H a for the appropriate 
test: (I) With X i fixed, does wearing a bump cap reduce the expected severity of injury as 
compared with wearing no protection? (2) With X i fixed, is the expected severity of injury 
the same when wearing t\ hard hat as when wearing a bump cap? 

8.22. Refer to tool wear regression model (8.36). Suppose the indicator variables had been defined as 
follows: Xt = I if tool model M2 and 0 otherwise, = 1 if tool model M3 and 0 otherwise, 

= l if tool model M4 and () otherwise. Indicate the meaning of each of the following: (Oft. 
(2))04-^ (3))6,. i 



Chapter 8 Regression Models for Quantitative and Qualitative Predictors 339 


8.23. A marketing research trainee in the national office of a chain of shoe stores used the following 

response function to study seasonal (winter, spring, summer, fall) effects on sales of a certain 
line of shoes: jE{y} = A) + + + fhX^. The Xs are indicator variables defined as 

follows: Xi = 1 if winter and 0 otherwise, X 2 = 1 if spring and 0 otherwise, X 3 = 1 if fall andO 
otherwise. After fitting the model, the trainee tested the regression coefficients (A: = 0,..., 3) 

and came to the following set of conclusions at an .05 family level of significance: 戸 0 ★ 0, 
Pi = 0 ,佐 ^ 0, )63 ^ 0. In the report the trainee then wrote: “Results of regression analysis 
show that climatic and other seasonal factors have no influence in determining sales of this 
shoe line in the winter. Seasonal influences do exist in the other seasons.” Do you agree with 
this interpretation of the test results? Discuss. 

8.24. Assessed valuations. A tax consultant studied the current relation between selling price and 
assessed valuation of one-family residential dwellings in a large taj? district by obtaining data 
for a random sample of 16 recent “arm’s-length” sales transactions of one-family dwellings 
located on comer lots and for a random sample of 48 recent sales of one-family dwellings not 
located on corner lots. In the data that follow, both selling price {Y) and assessed valuation 

are expressed in thousand dollars, whereas lot location (X 2 ) is coded 1 for comer lots 
and 0 for non-comer lots. 


1 : 

1 

2 

3 . 

.. 62 

63 

64 

Xn: 

76.4 

74.3 

69.6 

. 79.4 

74.7 

71.5 

X l2 ： 

0 

0 

P • 

. 0 

0 

1 


78.8 

73.8 

64.6 

. 97.6 

84.4 

70.5 


Assume that the error variances in the two populations are equal and that regression model (8.49) 
is appropriate. 

a. Plot the sample data for the two populations as a symbolic scatter plot. Does the regression 
relation appear to be the same for the two populations? 

b. Test for identity of the regression functions for dwellings on comer lots and dwellings in 
other locations; control the risk of Type I error at .05. State the alternatives, decision rule, 
and conclusion. 

c. Plot the estimated regression functions for the two populations and describe the nature of 
the differences between them. 

8.25. Refer to Grocery retailer Problems 6.9 and 7.4. 

a. Fit regression model (8.58) using the number of cases shipped (Xi) and the binary variable 
(X 3 ) as predictors. 

b. Test whether or not the interaction terms and the quadratic term can be dropped from the 
model; use a = .05. State the alternatives, decision rule, and conclusion. What is the P-value 
of the test? 

8.26. In time series analysis, the X variable representing time usually is defined to take on values 
1, 2, etc., for the successive time periods. Does this represent an allocated code when the time 
periods are actually 1989, 1990, etc.? 

8.27. An analyst wishes to include number of older siblings in family as a predictor variable in a re¬ 
gression analysis of factors affecting maturation in eighth graders. The number of older siblings 
in the sample observations ranges from 0 to 4. Discuss whether this variable should be placed 
in the model as an ordinary quantitative variable or by means of four 0 , 1 indicator variables. 

8.28. Refer to regression model (8.31) for the insurance innovation study. Suppose 你 were dropped 

from the model to eliminate the linear dependence in the X matrix so that the model becomes 
Y,- — 4 - § 2 ^a + ) 63 X /3 +e ( . What is the meaning here of each of the regression coeffi- 

cients 卢 | ， j 0 2 , and 你？ 




340 Part Two Multiple Linear Regression 


Set I: 
Set 2: 


Exercises 


8.29. Consider the second-order regression model with one predictor variable in (8.2) and the f 0 l 
lowing two yets of X values: 


For each set, calculate the coefficient of correlation between X and X~, then between x and 
Also calculate the coefficients of correlation between X and and between a and x 3 . What 
generalizations are suggested by your results? 

8.30. (Calculus needed.) Refer to second-order response function (8.3). Explain precisely the meaning 
of the linear effect coefficient jS] and the quadratic effect coefficient jSh. 

8.31. a. Derive the expressions; for b' 0 , b\, and b \, in (8.12). 

b. Using (5.46), obtain the variance-covariance matrix for the regression coefficients pertaining 
to the original X variable in terms of the variance-covariance matrix for the regression 
coefficients pertaining to the transformed jc variable. 

8.32. How are the normal equations (8.4) simplified if tl^e X values; are equally spaced, such as the 
time series representation X ] = \, Xo = 2,..., X„ — nl 

8.33. Refer to the instrument calibration study example in Section 8.7. Suppose that three instruments 
(A, B, C) had been developed to identical specifications, that the regression functions relating 
gauge reading (7) to actual pressure are second-order polynomials for each instrument, 
that the error variances are the same, and that the polynomial coefficients may differ from one 
instrument to the next. Let denote a second indicator variable, where = 1 if instrument 
C and 0 otherwise. 

a. Expand regression model (8.58) to cover this situation. 

b. State the alternatives, define the test statistic, and give the decision rule for each of the 
following tests when the level of significance is .01: (1) test whether the second-order re- 
gi-ession functions for the three instruments are identical, (2) test whether all three regression 
functions have the same intercept, (3) test whether both the linear and quadratic effects are 
the same in all three regression functions. 

8.34. In a regression study, three types of banks were involved, namely, commercial, mutual savings, 
and savings and loan. Consider the following system of indicator variables for type of bank: 


Type of Bank 


入 2 入 3 


Commercial 1 0 

Mutual savings 0 1 

Savings and loan -1 -1 


a. Develop a first-order linear regression model for relating last year's profit or loss (F) to size 
of bank (Xi) and type of bank (X 2 , X 3 ). 

b. State the response functions for the three types of banks. 

c. Interpret each of the following quantities: (1) j6 2 , (2) jS 3 , (3) — P 2 — 

8.35. Refer to regression model (8.54) and exclude variable X 3 . 

a. Obtain the X'X matrix for this special case of a single qualitative predictor variable, for 

i ~ \, _ n when n\ firms are not incorporated. 

b. Using (6.25), find b. 

c. Using (6.35) and (6.36), find SSE and SSR. 




Chapter 8 Regression Models for Quantitative and Qualitative Predictors 341 


Projects 


8.36. Refer to the CDI data set in Appendix C.2. It is desired to fit second-order regression model (8.2) 
for relating number of active physicians (7) to total population (X). 

a. Fit the second-order regression model. Plot the residuals against the fitted values. How well 
does the second-order model appear to fit the data? 

b. Obtain R 2 for the second-order regression model. Also obtain the coefficient of simple 
determination for the first-order regression model. Has the addition of the quadratic term in 
the regression model substantially increased the coefficient of determination? 

c. Test whether the quadratic term can be dropped from the regression model; use a = .05. 
State the alternatives, decision rule, and conclusion. 

8.37. Refer to the CDI data set in Appendix C.2. A regression model relating serious crime rate 
(F, total serious crimes divided by total population) to population density (Xi, total population 
divided by land area) and unemployment rate (X 3 ) is to be constructed. 

a. Fit second-order regression model ( 8 , 8 ). Plot the residuals against the fitted vali^s. How 
well does the second-order model appear to fit the data? What is R 2 1 

b. Test whether or not all quadratic and interaction terms can be dropped from the regression 
model; use a = .01. State the alternatives, decision rule, and conclusion. 

c. Instead of the predictor variable population density, total population (X!) and land area 
(X 2 ) are to be employed as separate predictor variables, in addition to unemployment rate 
(■X 3 ). The regression model should contain linear and quadratic terms for total population, 
and linear terms only for land area and unemployment rate. (No interaction terms are to be 
included in this model.) Fit this regression model and obtain R 2 .ls this coefficient of multiple 
determination substantially different from the one for the regression model in part (a)? 

8.38. Refer to the SENIC data set in Appendix C,l, Second-order regression model (8.2) is to be 
fitted for relating number of nurses (T) to available facilities and services (X). 

a. Fit the second-order regression model. Plot the residuals against the fitted values. How well 
does the second-order model appear to fit the data? 

b. Obtain R 2 for the second-order regression model. Also obtain the coefficient of simple 
determination for the first-order regression model. Has the addition of the quadratic term in 
the regression model substantially increased the coefficient of determination? 

c. Test whether the quadratic term can be dropped from the regression model; use a = .01. 
State the alternatives, decision rule, and conclusion. 

8.39. Refer to the CDI data set in Appendix C.2. The number of active physicians (F) is to be 
regressed against total population (Xi), total personal income (X 2 ), and geographic region 
(^3, ^4, X 5 ). 

a. Fit a first-order regression model. Let = 1 if NE and 0 otherwise, X 4 = 1 if NC and 0 
otherwise, and X 5 = 1 if S and 0 otherwise. 

b. Examine whether the effect for the northeastern region on number of active physicians 
differs from the effect for the north central region by constructing an appropriate 90 percent 
confidence interval. Interpret your interval estimate. 

c. Test whether any geographic effects are present; "use a = .10. State the alternatives, decision 
rule, and conclusion. What is the P-value of the test? 

8.40. Refer to the SENIC data set in Appendix C. 1 . Infection risk (7) is to be regressed against length 
of stay (Xi), age (X 2 ), routine chest X-ray ratio (X 3 ), and medical school affiliation (X 4 ). 

a. Fit a first-order regression model. Let X4 = 1 if hospital has medical school affiliation and 
0 if not. 



342 Part Two Multiple Linear Regression 


Case 

Study 


b. Estimiiie the ctTect of medical school affiliution on infection risk using a 98 percent confi 
dence interval. Interprei your interval esiimute. 

c. It has been suggested that the effect of medical school affiliation on infection risk niuy interact 
with the effects of age and routine chest X-i ay ratio. Add approprime intemetion terms to 
the regression model，fit the revised regression model, and test whether the interaction terms 
are helpful; use o? = . 1 0, State the alternatives, decision rule, and conclusion. 


8.41. Refer to the SENIC data set in Appendix C. I. Length of stay ( Y) is to be regressed on ao e 

(Xi), routine culturing ralio (Xt), average daily census (X 3 ), available facilities and services 

(Xj), and region (X 5 , X 6 , X 7 ). 

a. Fit a first-order regression model. Let Xf, = I if NE and 0 otherwise, = I if NC und 0 
otherwise, and X 1 = I if S and 0 otherwise. 

b. Test whether the routine culturing ratio can be dropped from the model; use a level of 
significance of .05. State the alternatives, decision rule, and conclusion. 

c. Examine whether the effect on length of stay for hospitals located in the westem region differs 
from that for hospitals located in the other three regions by constructing an appropriate 
confidence interval for each pairwise comparison. Use the Bonfen'Oni procedure with a 
95 percent family confidence coefficient. Summarize your lindings. 


8.42. Refer to Market share data set in Appendix C.3. Company executives want to be able to 
predict market share of their product (K) based on merchandise price (X|), the gross Nielsen 
rating points (X 2 , an index of the amount of advertising exposure that the produci received); 
the presence or absence of a wholesale pricing discount (X 3 二 I if discount present; otherwise 
X 3 = 0 ); the presence or absence of a package promotion during the period (X 4 二 I if promotion 
present; otherwise = 0); and year (X^). Code year as a nominal level variable and use 2000 
as the referent year. 


a. Fit a first-order regression model. Plot the residuals against the fitted values. How well does 
the first-order model appear to lit the data? 

b. Re-fit the model in parr(ci). after adding all second-order terms involving only the quantitative 
predictors. Test whether or not all quadratic and interaction terms can be dropped from the 
regression model: use a = .05. State the alternatives, decision rule, and conclusion. 

c. In part (a), test whether advertising index ( X 2 ) and year (XO can be dropped from the model,- 
use or = .05. State the alternatives, decision rule, and conclusion. 


8.43. Refer to University admissions data set in Appendix C.4. The director of admissions at a state 
university wished to determine how accurately students' grade-point averages at the end of their 
freshman year (K) can be predicted from the entrance examination (ACT) test score {X 2 )\ the 
high school class rank (X ( , a percentile where 99 indicates student is at or near the top of his 
or her class and i indicates student is at or near the bottom of the class); and the academic year 
(X 3 ). The academic year variable covers the years 1996 through 2000. Develop a prediction 
model for the director of admissions. Justify your choice of model. Assess your model's ability 
to predict and discuss its use as a tool for admissions decisions. 




Chapter l 


Building the Regression 
Model I ： Model Selection 
and Validation 


■ In earlier chapters, we considered how to fit simple and multiple regression models and how 

to make inferences from these models. In tMs chapter, we first present an overview of the 
model-building and model-validation process. Then we consider in more detail some special 
issues in the selection of the predictor variables for exploratory observational studies. We 
conclude the chapter with a detailed description of methods for validating regression models. 

9.1 Overview of Model-Building Process_ 

At the risk of oversimplifying, we present in Figure 9.1 a strategy for the building of a 
regression model. This strategy involves three or, sometimes, four phases: 

1. Data collection and preparation 

2. Reduction of explanatory or predictor variables (for exploratory observational studies) 

3. Model refinement and selection 

4. Model validation 

We consider each of these phases in turn. 

Data Collection 

The data collection requirements for building a regression model vary with the nature of 
the study. It is useful to distinguish four types of studies. 

Controlled Experiments. In a controlled experiment, the experimenter controls the 

levels of the explanatory variables and assigns ci treatment, consisting of a combination 

of levels of the explanatory variables, to each experimental unit and observes the response. 

For example, an experimenter studied the effects of the size of a graphic presentation and 

the time allowed for analysis of the accuracy with which the analysis of the presentation is 

carried out. Here, the response variable is a measure of the accuracy of the analysis, and the 

explanatory variables are the size of the graphic presentation and the time allowed. Junior 

343 


FIGURE 9.1 

Strategy for 
Building a 
Regression 
Model. 


344 




Collect data 


J 



1 

t 


Preliminary checks 
on data quality 


1 



Diagnostics for 
relationships and 
strong interactions 



Remedial 

measures 


人 

remedial 



Yes 


needed? 


No 




Determine several 
potentially useful 〆 
subsets of explanatoiy 
variables; include known 
essential variables 




f 



Investigate curvature 


- ^ 

and interaction 



effects more fully | 



— 1 

f 

Remedial 1 

Study residuals and 

measures I 

other diagnostics 


Yes 

Remedial^. 

<： measures 


needed? 


No 



No 



Data collection 
and preparation 




Reduction of 
number of explanatoiy 
variables (for 
exploratory 
observational studies) 


Model refinement 
and selection 


Model 

validation 







Chapter 9 Building the Regression Model I: Model Selection and Validation 345 


executives were used as the experimental units. A treatment consisted of a particular com¬ 
bination of size of presentation and length of time allowed. In controlled experiments, the 
explanatory variables are often celled factors or control variables. 

The data collection requirements for controlled expenments are straightforward, though 
not necessarily simple. Observations for each experimental unit are needed on the response 
variable and on the level of each of the control variables used for that experimental unit. 
There maybe difficult measurement and scaling problems for the response variable that are 
unique to the area of application. 

Controlled Experiments with Covariates. Statistical design of experiments uses sup¬ 
plemental information, such as characteristics of the experimental units, in designing the 
experiment so as to reduce the variance of the experimental error terms in the regression 
model. Sometimes, however, it is not possible to incorporate this supplemental infcSrmation 
into the design of the experiment. Instead, it may be possible for the experimenter to incor¬ 
porate tlus information into the regression model and thereby reduce the error variance by 
including uncontrolled variables or covariates in the model. 

In our previous example involving the accuracy of analysis of graphic presentations, 
the experimenter suspected that gender and number of years of education could affect the 
accuracy responses in important ways. Because of time constraints, the experimenter was 
able to use only a completely randomized design, which does not incorporate any supple¬ 
mental information into the design. The experimenter therefore also collected data on two 
uncontrolled variables (gender and number of years of education of the junior executives) 
in case that use of these covariates in the regression model would make the analysis of 
the effects of the explanatory variables (size of graphic presentation, time allowed) on the 
accuracy response more precise. 

Confirmatory Observational Studies. These studies, based on observational, not experi¬ 
mental, data, are intended to test (i.e., to confirm or not to confirm) hypotheses derived from 
previous studies or from hunches. For these studies, data are collected for explanatory vari¬ 
ables that previous studies have shown to affect the response variable, as well as for the new 
variable or variables involved in the hypothesis. In this context, the explanatory variable(s) 
involved in the hypothesis are sometimes called the primary variables, and the explanatory 
variables that are included to reflect existing knowledge are called the control variables 
{known risk factors in epidemiology). The control variables here are not controlled as in 
an experimental study, but they are used to account for known influences on the response 
variable. For example, in an observational study of the effect of vitamin E supplements 
on the occurrence of a certain type of cancer, known risk factors, such as age, gender, and 
race, would be included as control variables and the amount of vitamin E supplements 
taken daily would be the primary explanatory variable. The response variable would be the 
occurrence of the particular type of cancer during the period under consideration. (The use 
of qualitative response variables in a regression model will be^considered in Chapter 14.) 

Data collection for confirmatory observational studies involves obtaining observations on 
the response variable, the control variables, and the primary explanatory variable(s). Here, as 
in controlled experiments, there maybe important and complex problems of measurement, 
such as how to obtain reliable data on the amount of vitamin supplements taken daily. 

Exploratory Observational Studies. In the social, behavioral, and health sciences, man¬ 
agement, and other fields, it is often not possible to conduct controlled experiments. 



346 Part Two Multiple Linear Regression 


Furthermore, adequate knowledge for conducting confirmatory observational studies m 
be lacking. As a result, many studies in these fields are exploratory observational studies 
where investigators search for explanatory variables that might be related to the response 
variable. To complicate matters further, any available theoretical models may involve ex 
planatory variables that are not directly measurable, such as a family’s future earnings over 
the next 10 years. Under these conditions, investigators are often forced to prospect forex 


planatory variables that could conceivably be related to the response variable under study 
Obviously, such a set of potentially useful explanatory variables can be large. For exam¬ 


ple, a company’s sales of portable dishwashers in a district may be affected by population 
size, per capita income, percent of population in urban areas, percent of population under 
50 years of age, percent of families with children at home, etc., etc.! 

After a lengthy list of potentially useful explanatory variables has bben compiled, some 


of these variables can be quickly screened out. An explanatory variable (1) may not be 
fundamental to the problem, (2) may be subject to large measurement errors, and/or (3) may 
effectively duplicate another explanatory variable in the list. Explanatory variables that 


cannot be measured may either be deleted or replaced by proxy variables that are highly 
correlated with them. 


The number of cases to be collected for an exploratory observational regression study 
depends on the size of the pool of potentially useful explanatory variables available at this 
stage. More cases are required when the pool is large than when it is small. A general rule 
of thumb states that there should be at least 6 to 10 cases for every variable in the pool. 
The actual data collection for the pool of potentially useful explanatory variables and for 
the response variable again may involve important issues of measurement, just as for the 
other types of studies. 


Data Preparation 

Once the data have been collected, edit checks should be performed and plots prepared 
to identify gross data errors as well as extreme outliers. Difficulties with data errors are 
especially prevalent in large data sets and should be corrected or resolved before the modd 
building begins. Whenever possible, the investigator should carefully monitor and control 
the data collection process to reduce the likelihood of data errors. 

Preliminary Model Investigation 

Once the data have been properly edited, the formal modeling process can begin. A va- 
' riety of diagnostics should be employed to identify (1) the functional forms in which the 

explanatory variables should enter the regression model and (2) important interactions that 
should be included in the model. Scatter plots and residual plots are useful for determining 
relationships and their strengths. Selected explanatory variables can be fitted in regression 
functions to explore relationships, possible strong interactions, and the need for transfor¬ 
mations. Whenever possible, of course, one should also rely on the investigator’s prior 
knowledge and expertise to suggest appropriate transformations and interactions to inves¬ 
tigate. This is particularly important when the number of potentially useful explanatory 
variables is large. In this case, it may be very difficult to investigate all possible pair- 
■ wise interactions, and prior knowledge should be used to identify the important ones. The 
diagnostic procedures explained in previous chapters and in Chapter 10 should be used as 
resources in this phase of model building. 



Chapter 9 Building the Regression Model I: Model Selection and Validation 347 


Reduction of Explanatory Variables 

Controlled Experiments. The reduction of explanatory variables in the model-building 
phase is usually not an important issue for controlled experiments. The experimenter has 
chosen the explanatory variables for investigation, and a regression model is to be devel¬ 
oped that will enable the investigator to study the effects ; of these variables on the response 
variable. After the model has been developed, including the use of appropriate functional 
forms for the variables and the inclusion of important interaction terms, the inferential proce¬ 
dures considered in previous chapters will be used to determine whether the explanatory vari¬ 
ables have effects on the response variable and, if so, the nature and magnitude of the effects. 

Controlled Experiments with Covariates. In studies of controlled experiments with 
covariates, some reduction of the covariates may take place because investigators often 
cannot be sure in advance that the selected covariates will be helpful in reducing the error 
variance. For instance, the investigator in our graphic presentation example ma^ wish to 
examine at this stage of the model-building process whether gender and number of years 
of education are related to the accuracy response, as had been anticipated. If not, the 
investigator would wish to drop them as not being helpful in reducing the model error 
^ variance and, therefore, in the analysis of the effects of the explanatory variables on the 
• response variable. The number of covariates considered in controlled experiments is usually 

small, so no special problems are encountered in determining whether some or all of the 
covariates should be dropped from the regression model. 

Confirmatory Observational Studies. Generally, no reduction of explanatory variables 
should take place in confirmatory observational studies. The control variables were chosen 
on the basis of prior knowledge and should be retained for comparison with earlier studies 
even if some of the control variables turn out not to lead to any error variance reduction 
in the study at hand. The primary variables are the ones whose influence on the response 
variable is to be examined and therefore need to be present in the model. 

Exploratory Observational Studies. In exploratory observational studies, the number of 
explanatory variables that remain after the initial screening typically is still large. Further, 
many of these variables frequently will be highly intercorrelated. Hence, the investigator 
usually will wish to reduce the number of explanatory variables to be used in the final 
model. There are several reasons for this. A regression model with numerous explanatory 
variables maybe difficult to maintain. Further, regression models with a limited number of 
explanatory variables are easier to work with and understand. Finally, the presence of many 
highly intercorrelated explanatory variables may substantially increase the sampling vari¬ 
ation of the regression coefficients, detract from the model’s descriptive abilities, increase 
the problem of roundoff errors (as notqd in Chapter 7), and not improve, or even worsen, 
the model’s predictive ability.-An actual worsening of the model’s predictive ability can 
occur when explanatory variables are kept in the regression model that are not related to 

a 

the response variable, given the other explanatory variables in the model. In that case, the 
variances of the fitted values cr 2 {Yi} tend to become larger v^th the inclusion of the useless 
additional explanatory variables. • 

Hence, once the investigator has tentatively decided upon the functional form of the 
regression relations (whether given variables are to appear in linear form, quadratic form, 
etc.) and whether any interaction terms are to be included, the next step in many exploratory 



348 Part Two Multiple Linear Regression 


observational studies is to identify a few “good” subsets of X variables for further inte n 
sive study. These subsets should include not only the potential explanatory variables i n 
first-order form but also any needed quadratic and other curvature terms and any necessary 
interaction terms. 

The identification of "good” subsets of potentially useful explanatory variables to be 
included in the final regression model and the determination of appropriate functional 
and interaction relations for these variables usually constitute some of the most difficult 
problems in regression analysis. Since the uses of regression models vary, no one subset of 
explanatory variables may always be “best.” For instance, a descriptive use of a regression 
model typically will emphasize precise estimation of the regression coefficients, whereas 
a predictive use will focus on the prediction errors. Often, different subsets of the pool of 
potential explanatory variables will best serve these varying purposes? Even for a given 
purpose, it is often found that several subsets are about equally “good” according to a given 
criterion, and the choice among these “good” subsets needs to be made on the basis of 
additional considerations. 

The choice of a few appropriate subsets of explanatory variables for final consideration 
in exploratory observational studies needs to be done with great care. Elimination of key 
explanatory vaiiables can seriously damage the explanatory power of the model and lead 
to biased estimates of regression coefficients, mean responses, and predictions of new 
observations, as well as biased estimates of the error variance. The bias in these estimates is 
related to the fact that with observational data, the error terms in an underfitted regression 
model may reflect nonrandom effects of the explanatory variables not incorporated in the 
regression model. Important omitted explanatory variables are sometimes called latent 
explanatory varictbles. 

On the other hand, if too many explanatory variables are included in the subset, then this 
overfitted model will often result in variances of estimated parameters that are larger than 
those for simpler models. 

Another danger with observational data is that important explanatory variables may be 
observed only over narrow ranges. As a result, such important explanatory variables may 
be omitted just because they occur in the sample within a narrow range of values and 
therefore turn out to be statistically nonsignificant. 

Another consideration in identifying subsets of explanatory variables is that these subsets 
need to be small enough so that maintenance costs are manageable and analysis is facilitated, 
yet large enough so that adequate description, control, or prediction is possible. 

A variety of computerized approaches have been developed to assist the investigator 
in reducing the number of potential explanatory variables in an exploratory observational 
study when these variables are correlated among themselves. We present two of these 
approaches in this chapter. The first, which is practical for pools of explanatory variables 
that are small or moderate in size, considers all possible subsets of explanatory variables 
that can be developed from the pool of potential explanatory variables and identifies those 
subsets that are “good” according to a criterion specified by the investigator. The second 
approach employs automatic search procedures to arnve at a single subset of the explanatory 
variables. This approach is recommended piimarily for reductions involving large pools of 
explanatory variables. 

Even though computerized approaches can be very helpful in identifying appropriate 
subsets for detailed, final consideration, the process of developing a useful regression model 
must be pragmatic and needs to utilize large doses of subjective judgment. Explan 咖 iy 



Chapter 9 Building the Regression Model I: Model Selection and Validation 349 


variables that are considered essential should be included in the regression model before 
any computerized assistance is sought. Further, computerized approaches that identify only 
a single subset of explanatory variables as ‘iDest” need to be supplemented so that additional 
subsets are also considered before the final regression mo^el is decided upon. 



Comments 

1 . All too often, unwaiy investigators will screen a set of explanatory variables by fitting the 
regression model containing the entire set of potential X variables and then simply dropping those 
for which the t* statistic (7.25): 




b k 

s{b k } 



has a small absolute value. As we know from Chapter 7, this procedure can lead to the dropping 
of important infercorrelated explanatory variables. Clearly, a good search procedure must be able 
to handle important intercorrelated explanatory variables in such a way that not all of them will be 
dropped- 

2. Controlled experiments can usually avoid many of the problems in exploratory observational 
studies. For example, the effects of latent predictor variables are minimized by using randomization. 
In addition, adequate ranges of the explanatory variables can be selected and correlations among the 
explanatory vari^iles can be eliminated by appropriate choices of their levels. _ 


Model Refinement and Selection 

At this stage in the model-building process, the tentative regression model, or the several 
“good” regression models in the case of exploratory observational studies, need to be 
checked in detail for curvature and interaction effects. Residual plots are helpful in deciding 
whether one model is to be preferred over another. In addition, the diagnostic checks to 
be described in Chapter 10 are useful for identifying influential outlying observations, 
multicollinearity, etc. 

The selection of the ultimate regression model often depends greatly upon these diag¬ 
nostic results. For example, one fitted model maybe very much influenced by a single case, 
whereas another is not. Again, one fitted model may show correlations among the error 
terms, whereas another does not. 

When repeat observations are available, formal tests for lack of fit can be made. In 
any case, a variety of residual plots and analyses can be employed to identify any lack of 
fit, outliers, and influential observations. For instance, residual plots against cross-product 
and/or power terms not included in the regression model can be useful in identifying ways 
in which the model fit can be improved further. " 

When an automatic selection procedure is utilized for an exploratory observational study 
and only a single model is identified as “best,” other models should also be explored. One 
procedure is to use the number of explanatory variables in the model identified as <4 best” as 
an estimate of the number of explanatory variables^needed in the regression model. Then 
the investigator explores and identifies other candidate models with approximately the same 
number of explanatory variables identified by the automatic procedure. 

Eventually, after thorough checking and various remedial actions, such as transforma¬ 
tions, the investigator narrows the number of competing models to one or just a few. At this 
point, it is good statistical practice to assess the validity of the remaining candidates through 
model validation studies. These methods can be used to help decide upon a final regression 
model, and to determine how well the model will perform in practice. 



350 Part Two Multiple Linear Re^ivxsion 


blood clotting score 
prognostic index 
enzyme function test score 
liver function test score 
age, in years 

indicator variable for gender (0 = male, 1 = female) 
indicator variables for history of alcohol use: 


Alcohol Use 

x 7 

x 8 

None 

0 

0 

Moderate 

1 

0 

Severe 

0 

1 


Model Validation 

Model validity refers to the stability and reasonableness of the regression coefficients the 
plausibility and usability of the regression function, and the ability to generalize infer 
ences drawn from the regression analysis. Validation is a useful and necessary part of the 
model-building process. Several methods of assessing model validity will be described in 
Section 9.6. 

9.2 Surgical Uiul Example 

" ' 1 *■ 1 y 1 y ' ' " " m 1 ~ —— ■ — , 

With the completion of this overview of the model-building process for ^regression study 
we next present an example that will be used to illustrate all stages of this process as they 
are taken up in this and the following two chapters. A hospital surgical unit was interested 
in predicting survival in patients undergoing a particular type of liver operation. A random 
selection of 108 patients was available for analysis. From each patient record, the following 
information was extracted from the preoperation evaluation: 


These constitute the pool of potential explanatory or predictor variables for a predictive 
regression model. The response variable is survival time, which was ascertmned in a follow¬ 
up study. A portion of the data on the potential predictor variables and the response variable is 
presented in Table 9.1. These data have already been screened and properly edited for errors. 

TABLE 9.1 Potential Predictor Variables and Response Variable — Surgical Unit Example. 


Survival 

Time 

Y ( 

695 

403 

710 


Blood- 

Clotting 

Score 


y/ = In Yi 

6.544 

5.999 

6.565 

6.361 

6.310 

6.478 


Prognostic Enzyme 


Index 

Test 

Xi2 

Xi3 

62 

81 

59 

66 

57 

83 

85 

40 

59 

85 

78 

72 


Case 

Number 

_ 

/ 

1 

2 

3 

52 

53 

54 


X 

IQ 

a 

1 234 5 6 7 

xxxxxxx 


* 9 0 1 
*755 
•556 


7 1 4 _ 4 4 8 

6 - 5 - 7 -M 6 - 6 - 8 - 


vy 


Ak.useeaxfsooo jloo 


Llc.se: 


Al 


u M 


H 

d.7 

° x/ 1 o o , o 1 o 


e 

nds 


e 

G 

e 


X 


0 0 0*000 


o 9 5 _ 8 3 6 
5 3 5.565 


u 


5t 4 906 _130 
Tes x, 2.5 1.72.11.2 2.33.2 



Chapter 9 Building the Regression Model I: Model Selection and Validation 351 


(a) Residual Plot for Y (b) Normal Plot for Y 

1000 


FIGURE 9.2 
Some 

Preliminary 
Residual 
Plots—Surgical 
Unit Example. 


■500 


0 500 1000 

Predicted value 


1500 


0.6 


(c) Residual Plot for InY 


3 —2 


1 


0 


2 


i 


Expected value 


(d) Normal Plot for InY 


To illustrate the model-building procedures discussed in this and the next section, we will 
use only the first four explanatory variables. By limiting the numberjof potential explanatory 
variables, we can explain the procedures without overwhelming the reader with masses of 
computer printouts. We will also use only the first 54 of the 108 patients. 

Since the pool of predictor variables is small, a reasonably full exploration of relation¬ 
ships and of possible strong interaction effects is possible at this stage of data preparation. 
Stem-and-leaf plots were prepared for each of the predictor variables (not shown). These 
highlighted several cases as outlying with respect to the explanatory variables. The investi¬ 
gator was thereby alerted to examine later the influence of these cases. A scatter plot matrix 
and the correlation matrix were also obtained (not shown). _ 

A first-order regression model based on all predictor variables was fitted to serve as a 
starting point. A plot of residuals against predicted values for this fitted model is shown 
in Figure 9.2a. The plot suggests that both curvature and nonconstant error variance are 
apparent. In addition, some departure from normality is suggested by the normal probability 
plot of residuals in Figure 9.2b. 

To make the distribution of the error terms more nearly normal and to see if the same 
transformation would also reduce the apparent curvature, the investigator examined the 


■0.4 


5 5.5 6 6.5 7 

Predicted value 


7.5 




3-2—1 0 1 

Expected value 


2 





00 


lenplsa^ 


4 2 

d 0 - 


|enp!say 



352 Part Two Multiple Linear Regression 


FIGURE 9.3 
JMP Scatter 
Plot Matrix 
and 

Correlation 
Matrix when 
Response 
Variable Is 
Y' — Sui^ical 
Unit Example. 


logarithmic transformation Y f = In Y. Data for the transformed response variable are ； also 
given in Table 9.1. Figure 9.2c shows a plot of residuals against fitted values when Y' [ s 
regressed on all four predictor variables in a first-order model; also the normal probability 
plot of residuals for the transformed data shows that the distribution of the error terms is 
more nearly normal. 

The investigator also obtained a scatter plot matrix and the correlation matrix with the 
transformed Y variable; these are presented in Figure 9.3. In addition, various scatter and 


Multivariate Correlations 



LnSurvival 

Bloodclot 

Progindex 

Enzyme 

,> iver 

LnSurvival 

1.0000 

0.2462 

0.4699 

0.6539 

0.6493 

Bloodclot 

0.2462 

1.0000 

0.0901 

-0.1496 

0.5024 

Progindex 

0.4699 

0.0901 

1.0000 

—0.0236 

0.3690 

Enzyme 

0.6539 

— 0.1496 

—0.0236 

1.0000 

0.4164 

Liver 

0.6493 

0.5024 

■卞 

0.3690 

0.4164 

1.0000 


Scatterplot Matrix 


8 

7.5 
7 

6.5 
6 

5.5 

11 

9 

7 

5 

3 

90 

70 

50 

30 

10 

110 

90 

70 

50 

30 

7 

5 

3 



國 


H| 

國 

_ . 

—* * * 

_ .* 

二 * 二，二 

~ ■ * 

Bloodclot 

• 

• • • 

• • .* 

• 

. 

• * • 

• * 

° * * r** 

« 

••4-： 

.w * * 

* 

™" 華 

_ • 

— * • 

- 

-* * * 

_ % 

— • 

• 

* 科 * 

* 7 O 

• • 

* 蠡 

° * 

• 

Progindex 

« 

* • • *8 \ 
o* * *，• 

• • 

• * 

* 

« 

' * ，• * 

•务， A * 

• ^ 0 

* • 

* « 

• 

* . • 

— o# * • 

B ^ 

_ * 

一 

— 知•… 

: 〆 

— • 

B 

# .. 

» • 

二 ^ 

■ 

m 

• « * * 

• *»••••? 

• *. 
a .* • 

• 

Enzyme 

« •* 

v. * 

•• V** 

. 、 

# 

: . 

• 

_ • 

_1___1__11__1_ U 

• 

• 

• 丨 • 

• 

U 1 I..1 M .1 1 II 

• 

• 

i i i i i i i i i ii 

m 

• 

*o * m 

•• ^ • 

• 

... 1 1 ll 

Liver 

_i i i i i 1 」 


5.566.577.58 3456789 11 10 30 50 70 90 30 50 70 90110 1 2 3 4 5 6 7 






Chapter 9 Building the Regression Model I: Model Selection and Validation 353 


residual plots were obtained (not shown here). All of these plots indicate that each of the 
predictor variables is linearly associated with Y\ with X 3 and X 4 showing the highest 
degrees of association and Xi the lowest. The scatter plot matrix and the correlation matrix 
further show intercorrelations among the potential predictor variables. In particular, X 4 has 
moderately high pairwise correlations with X 2 , and X 3 . 

On the basis of these analyses, the investigator concluded to use, at this stage of the 
model-building process, F' = In F as the response variable, to represent the predictor vari¬ 
ables in linear terms, and not to include any interaction terms. The next stage in the model¬ 
building process is to examine whether all of the potential predictor variables are needed 
or whether a subset of them is adequate. A number of useful measures have been devel¬ 
oped to assess the adequacy of the various subsets. We now tunMo a discussion of these 
measures. 


i 

9.3 Criteria for Model Selection_ 

From any set of p — 1 predictors, 2 p ~ l alternative models can be constructed. This calcu- 
/ lation is based on the fact that each predictor can be either included or excluded from the 
model. For example, the 2 4 = 16 different possible subset models that can be formed from 
the pool of four X variables in the surgical unit example are listed in Table 9.2. First, there 
is the regression model with no X variables, i.e., the model f. Then there are 

the regression models with one X variable (Xi, X 2 , X 3 , X 4 ), with two X variables (Xi and 
X 2 , Xi and X 3 , X\ and X 4 , X 2 and X 3 , X 2 and X 4 , X 3 and X 4 ), and so on. 


TABLE 9.2 SSE P ， R 2 p ， R 2 a ’ p , C P ， AIC P ， SBC P , and PRESS p Values for AU Possible Regression 
Models — Surgical Unit Example. 



⑴ 

(2) 

⑶ 

(4) 

⑶ 

⑹ 

(7) 

(8) 

%riabtes 

JnJ^iodel 

p 

SSE P 


^ o,P 

C/j 

AIC p 

SBC p 

PRESS p 

ifee 

1 - 

12.808 

0.000 

0.000 

151.498 

-75.703 

—73.714 

13.296 

痗 

2 

12.031 

0.Q61 

0:043 

141.164 

-77.079 

—73.101 

13.512 


2 

9.979 

Q.221 

0.206 

108.556 

—87.178 

-83.200 

10.744 

S 

2 

7.332 

0.428 

0.417 

66.489 

-103.827 

-99.849 

8.327 

1 

2 

7.409 

0.422 

0.410 

67.715 

-103.262 

-99.284 

8,025 


3 

9.443 

0.263 

0.234 

102.031 

-88.162 

-82.195 

11.062 

:欠 1 /入 3 

3 

5.781 

0.549 

0.531 

43.852 

-114.658 

-108.691 

6.988 


3 

7.299 

0.430 

0.408 

* 67.972 

-102.067 

-96.100 

8.472 


3 

4.312 

0.663 

K).650 

20.520 

-1 30.483 

-124.516 

5.065 

靡，馬； 

3 

6,622 

0.483 

0.463 

57.215 r 

-107.324 

-101.357 

7.476 

Ji. 薦 

3 

5.130 

0.599 

0.584 

33.504 

-121.113 

-115.146 

6.121 


4 

3.109 

0.757 

0.743, 

3.391 

—146461 

一 1 38.205 

3.914 

^2 / ^4 

4 

6.570 

0.487 

0.456 

58.392 

-105.748 

-97.792 

7.903 

^3 / X 4 

x 4 

4 

4 

4.968 

3.614 

0.612 

0.718 

0.589 

0.701 

* 32.932 

11.424 

-120.844 
—1 38.023 

-112.888 
-1 30.067 

6.207 

4.597 

為沪各 X4 

5 

3.084 

0.759 

0.740 

5.000 

-H4.S90 

—134.645 

4.069 


354 Part Two Multiple Linear Re^re.'ision 


In most circumstances, it will be impossible for an analyst to make a detailed exiimination 
of all possible regression models. For instance, when there ai*e 10 potential X variables in 
the pool, there would be 2 I0 = 1,024 possible regression models. With the availability 0 f 
high-speed computers and efficient algorithms, running all possible regression models for 
10 potential X variables is not time consuming. Still, the sheer volume of 1,024 alternative 
models to examine carefully would be an overwhelming task for a data analyst. 

Model selection procedures, also known as subset selection or variables selection proce¬ 
dures, have been developed to identify a small group of regression models that are “good” 
according to a specified criterion. A detailed examination can then be made of a limited 
number of the more promising or “candidate” models, leading to the selection of the final 
regression model to be employed. This limited number might consist of three to six “good” 
subsets according to the criteria specified, so the investigator can then carefully study these 
regression models for choosing the final model. 

While many criteria for comparing the regression models have been developed, we will 
focus on six: Rj } , C p , AIC p , SBC P , and PRE§S p . Before doing so, we will need to 
develop some notation. We shall denote the number of potential X variables in the pool by 
P — \. We assume throughout this chapter that all regression models contain an intercept 
term jS 0 . Hence, the regression function containing all potential X variables contains P 
parameters, and the function with no X variables contains one parameter ()6。）■ 

The number of X variables in a subset will be denoted by /) — 1 ， as always, so that there 
are p parameters in the regression function for this subset of X variables. Thus, we have; 

\ < P < P ( 9 . 1 ) 

We will assume that the number of observations exceeds the maximum number of 
potential parameters: 

n> P ( 9 . 2 ) 

and, indeed, it is highly desirable that n be substantially larger than P, as we noted earlier, 
so that sound results can be obtained. 


Rp or SSEp Criterion 

The R 2 p criterion calls for the use of the coefficient of multiple determination R 2 , defined 
' in (6.40), in order to identify several “good” subsets of X variables — in other words, subsets 

for which R 2 is high. We show the number of parameters in the regression model as a 
subscript of R 2 . Thus R 2 p indicates that there are p parameters, or p — \ X variables, in the 
regression function on which R^ } is based. 

The R 2 p criterion is equivalent to using the error sum of squares SSE P as the criterion 
(we again show the number of parameters in the regression model as a subscript). With the 
SSE p criterion, subsets for which SSE,, is small are considered “good.” The equivalence of 
the R 2 p and SSE P criteria follows from (6.40); 


< =1 


SSE p 

SSTO 


( 93 ) 


Since the denominator SSTO is constant for all possible regression models, R 7 p varies 
inversely with SSE P . 



Chapter 9 Building the Regression Model I: Model Selection and Validation 355 


Example 


The Rp criterion is not intended to identify the subsets that maximize this criterion. 
We know that R 2 p can never decrease as additional X variables are included in the model. 
Hence, R 2 p will be a maximum when all F — 1 potential X variables are included in the 
regression model. The intent in using the R 2 p criterion is to find the point where adding more 
X variables is not worthwhile because it leads to a very small increase in R 2 p . Often, this 
point is reached when only a limited number of X variables is included in the regression 
model. Clearly, the determination of where diminishing returns set in is a judgmental one. 

Table 9.2 for the surgical unit example shows in columns 1 and 2 the number of parameters 
in the regression function and the error sum of squares for each possible regression model. 
In column 3 are given the R 2 p values. The results were obtained from a series of computer 
runs. For instance, when X 4 is the only X variable in the regression model, we obtain: 

^) =1 _ 740^ = i 

2 SSTO 12.808 

Note that SSTO = SSE'= 12.808. 

Figure 9.4a contains a plot of the R 2 p values against p, the number of parameters in the 
regression model. The maximum R 2 p value for the possible subsets each consisting of p — 1 
predictor variables, denoted by max(/?p, appears at the top of the graph for each p. These 
points are connected by solid lines to show the impact of adding additional X variables. 
Figure 9.4a makes it clear that little increase in max(/?p takes place after three X variables 
are included in the model. Hence, consideration of the subsets (Xi, X%, X 3 ) for which 
R\ = .757 (as shown in column 3 of Table 9.2) and (X 2 , X 3 , X 4 ) for which % = .718 
appears to be reasonable according to the R 2 p criterion. 

Note that variables X 3 and X 4 , correlate most highly with the response variable, yet this 
pair does not appear together in the max(i^) model for p = 4. This suggests that Xi, X 2 , 
and X 3 contain much of the information presented by X 4 . Note also that the coefficient 
of multiple determination associated with subset (X 2 , X 3 , X 4 ), = .718, is somewhat 

smaller than — .757 for subset (X u X 2 , X 3 ). 


R^ p or MSE P Criterion 

Since Rp does not take account of the number of parameters in the regression model 
and since max(^) can never decrease as p increases, the adjusted coefficient of multiple 
determination p in (6.42) has been suggested as an alternative criterion: 


R a, P — 


fn~ 1\ SSE p 
Xn-^-p) SSTO 


MSE p 
SSTO 
n — l 


(9.4) 


This coefficient takes the number of parameters in the regression model into account through 
the degrees of freedom. It can be seen from (9.4) that p increases if and only if MSE p 
decreases since SSTO/(n — 1) is fixed for the given Y observations. Hence, p and MSE p 
provide equivalent information. We shall consider here the criterion p1 again showing 
the number of parameters in the regression model as a subscript of the criterion. The largest 
p for a given number of parameters in the model, max(/?^ p ), can, indeed, decrease as 
p increases. This occurs when the increase in max(/^) becomes so small that it is not 



356 Part Two Multiple Linear Regression 


Example 


sufficient to offset the loss of an additional degree of freedom. Users of the p criterion 
seek to find a few subsets for which p is at the maximum or so close to the maximum 
that adding more variables is not worthwhile. 

The Rl p values for all possible regression models for the surgical unit example are shown 
in Table 9.2, column 4. For instance, we have for the regression model containing only 


1 -㈢ 


SSE(X 4 ) 

SSTO 


7.409 

12.808 


■410 


Figure 9.4b contains the p plot for the surgical unit example. We have again connected 
the max(Rl p ) values by solid lines. The story told by the p plot in Figure 9.4b is very 
similar to that told by the R 2 p plot in Figure 9.4a. Consideration of the subsets (X|, X 2 , 
Xs) and (X 2 , X3, X 4 ) appears to be reasonable according to the p criterion. Notice that 
4 = .743 is maximized for subset (Xi, X2, X 3 ), and that adding X 4 to this subset 一 thus 
using all four predictors ― decreases the criterion slightly: = .740. 


FiG URE 9.4 Plot of Variables Selection Criteria — Surgical Unit Example. 





Chapter 9 Building the Regression Model I: Model Selection and Validation 357 


Mallow^ C p Criterion 

This criterion is concerned with the total mean squared error of the n fitted values for each 
subset regression model. The mean squared error concept involves the total error in each 
fitted value: 


% — fM 


(9.5) 


where ju.,- is the true mean response when the levels of the predictor variables X k are those 
for the ith case. This total error is made up of a bias component and a random error 
component: 




_ /V 

1. The bias component for the ith fitted value Yi, also called the model error component, 
is: 


E{Y；} - ^ 


t 

(9.5a) 


/Where £{ 沁 } is the expectation of the ith fitted value for the given regression model. If 
the fitted model is not correct, £{y,-} will differ from the true mean response ju.,- and the 
difference represents the bias of the fitted model. 

_ 八 

2. The random error component for Yi is: 

Yi-E{Yi} (9.5b) 


This component represents the deviation of the fitted value Kf for the given sample from the 
expected value when the ith fitted value is obtained by fitting the same regression model to 
all possible samples. 


_ /V 

The mean squared error for Yi is defined as the expected value of the square of the total 
error in (9.5) — in other words, the expected value of: 

、 ~ = [m%} - M/) + (Yi - £{y；})] 2 

It can be shown that this expected value is: 

(E{Y,} - /x,) 2 + a 2 {Y；} (9.6) 

where a 2 {Yi} is the variance of the fitted value Y,-. We see from (9.6) that the mean squared 

八 * * 八 

error for the fitted value Y,- is the snm of the squared bias and the variance of 

The total mean squared error for all n fitted values jP ； is the sum of the n individual mean 
squared errors in (9.6): 

* ^ 

l M i ) 2 +a 2 {Y i }] = Y j (E{Y i }- im) 2 + ^0 2 {%} (9.7) 



58 Part Two Multiple Linear Regression 


Example 


The criterion measure, denoted by F f) , is simply the total mean squared error in (9.7) divide ^ 

by o 2 . the true error variance: 




a- 


乞(£比}-叫) 2 +^> 2 出} 


(9.8) 


The model which includes all P — 1 potential X variables is assumed to have been 
carefully chosen so that MSE{X\, )is an unbiased estimator of o 2 . It can then be 


shown that tin estimator of T /; is C p : 


SSE, 


C =_ 

p MSE(X U … ， X 卜、 


—(/7 - 2p) 


(9.9) 


where SSE p is the error sum of squares for the fitted subset regression model with p 
parameters (i.e., with p ~ \ X variables). 

When there is no bias in the regression model with /? — 1 X variables so that E{} = ^. 
the expected value of C fJ is approximately p ： "■ 

E{C p } ^ p when £{?,} = (9.10) 

Thus, when the C p values for all possible regression models are plotted against p, those 
models with little bias will tend to fall near the line C p = p. Models with substantial bias will 
tend to fall considerably above this line. C f , values below the line C p = p are interpreted as 
showing no bias, being below the line due to sampling error. The C p value for the regression 
model containing all P — 1 X variables is, by definition, P. The C p measure assumes that 

MSE(X I. Xp^i) is an unbiased estimator of cr 2 , which is equivalent to assuming that 

this model contains no bias. 

In using the C fl criterion, we seek to identify subsets of X variables for which (1) the 
C p value is small and (2) the C p value is near p. Subsets with small C p values have a small 
total mean squared error, and when the C p value is also near p, the bias of the regression 
model is small. It may sometimes occur that the regression model based on a subset of X 
variables with a small C p value involves substantial bias. In that case, one may at times 
prefer a regression model based on a somewhat larger subset of X variables for which the 
C p value is only slightly larger but which does not involve a substantial bias component 
Reference 9.1 contains extended discussions of applications of the C p criterion. 

Table 9.2, column 5, contains the C p values for all possible regression models for the surgical 
unit example. For instance, when is the only X variable in the regression model, the C p 
value is: 


C, 


SSE(X 4 ) 


观 (m，D 


-[«- 2 ( 2 )] 


7409 

3X)84 

~49~ 


n — 5 

-[54-2(2)] = 67.715 


The C,, values for all possible regression models are plotted in Figure 9.4c. We find that 
C p is minimized for subset (X|, X 2 , X 3 ). Notice that C p = 3.391 < p = 4 for this model, 
indicating little or no bias in the regression model. 




Chapter 9 Building the Regression Model I: Model Selection and Validation 359 


Note that use of all potential X variables (X i , X 2 , ^ 3 , X 4 ) results in a C p value of exactly 
P, as expected; here, C 5 = 5.00. Also note that use of subset (X 2 , X 3 , X 4 ) with C p value 
C 4 = 11.424 would be poor because of the substantial bias with this model. Thus, the C p 
criterion suggests only one subset (X! ， X 2 , for the surgical unit example. 


Comments 

1. Effective use of the C p criterion requires careful development of the pool of Z 3 — 1 potential X vari¬ 
ables, with the predictor variables expressed in appropriate form (linear, quadratic, transformed), 
and important interactions included, so that MSE(X 1 ,..., X P _|) provides an unbiased estimate 
of the error variance a 2 . 

2. The C p criterion places major emphasis on the fit of the subset model for the « sample observations. 
At times, a modification of the C p criterion that emphasizes new observations% be predicted may 
be preferable. 

3. To see why C p as defined m (9.9) is an estimator of F p , we need to utilize two results that we shall 

simply state. First, it can be shown that: i. 

n 

J^cr 2 {^} = pa 2 (9.11) 


Thus, the total i-andom error of the n fitted values Y t increases as the number of variables in the 
regression model increases. 

Further, it can be shown that: 


以 SSE P } = - Mf ) 2 + (« - vW 


(9.12) 


Hence, F p in (9.8) can be expressed as follows: 


Tp 


cr 


； [E{SSE p } 一 （《 - p)cr 2 + pa 2 ] 


E{SSE p ) 




in - 2 p) 


(9.13) 


Replacing E[SSE P ) by the estimator SSE p and using MSE{X x ,..., X P _|) as an estimator of a 2 
yields C p in (9.9). 

4. To show that the C p value for the regression model containing all P — 1 X variables is Z 3 , we 
substitute in (9.9), as follows: 

p SSEVC', … ， Xp—i) 1 1 n 

n — P 

= (n~ P)^(n~ 2P) 

U 


and SBC P Criteria * 

We have seen that both p and C p are model selection criteria that penalize models 
having large numbers of predictors. Two popular alternatives that also provide penalties 
for adding predictors are Akaike’s information criterion (AIC p ) and Schwarz，Bayesian 



360 Part Two Multiple Linear Regression 


criterion (SBC p ). We search for models that have small values of AIC p or SBC p , where 
these criteria are given by: 


AICp = n \nSSE p — n Inn -\-2p 
SBCp = n \nSSE [} — n In/? 4 - [In n\p 


(9.14) 

(9.15) 


Notice that for both of these measures, the first term is n \nSSE p , which decreases as p 
increases. The second term is fixed (for a given sample size n )，and the third term increases 
with the number of parameters, /)- Models with small SSE P will do well by these criteria 
as long as the penalties — 2p for AIC p and [ln/ 7 ]p for SBC r — are not too large, [f n > g 
the penalty for SBC p is larger than that for AlC p \ hence the SBC,, criterion tends to favor 
more parsimonious models. 


Example 


Table 9.2, columns 6 and 7, contains the AIC p and SBC p values for all possible reeression 
models for the surgical unit example. When X 4 is the only X variable in the regression 
model, the AIC p value is: ,, 

AI Ci — n \nSSE 2 — /1 ln /7 + 2 p 

= 54In7.409 - 54 In54+ 2(2) = -103.262 

Similarly, the SBC P value is: 

SBC 2 = n In SSE 2 — n In n + [inn]p 

= 54 In 7.409 - 54 In 54 + [ln54](2) = -99.284 


The AIC p and SBC P values for all possible regression models are plotted in Figures 9.4d 
and e. We find that both of these criteria are minimized for subset (X[, X 2 , X 3 ). 


PRESS p Criterion 

The PRESS,, (prediction sum of squares) criterion is a measure of how well the use of the 
fitted values for a subset model can predict the observed responses Y h The error sum of 
squares, SSE = [ ⑺一 F/) 2 , is also such a measure. The PRESS measure differs from SSE 

八 

in that each fitted value Yj for the PRESS criterion is obtained by deleting the /th case from 
the data set, estimating the regression function for the subset model from the remaining 

A 

n — 1 cases, and then using the fitted regression function to obtain the predicted value 匕⑴ 

a. 

for the /th case. We use the notation Y i( j 、now for the fitted value to indicate, by the first 
subscript i, that it is a predicted value for the zth case and, by the second subscript (/), that 
the /th case was omitted when the regression function was fitted 
The PRESS prediction error for the /th case then is: 

Yi _ ( 9 . 16 ) 

aiid the PRESS p criterion is the sum of the squared prediction errors over all n cases: 

PRESS p = - ^(/)) 2 ( 9 . 17) 

/ =1 

• Models with small PRESS p values are considered good candidate models. The reason is 

that when the prediction errors F, — are small, so are the squared prediction errors and 
the sum of the squared prediction errors. Thus, models with small PRESS,, values fit well 
in the sense of having small prediction errors. 



Chapter 9 Building the Regression Model I: Model Selection and Validation 361 


PRESS p values can be calculated without requiring n separate regression runs, each time 
deleting one of the n cases. The relationship in (10.21) and (10.21a), to be explained in the 
next chapter, enables one to calculate all values from a single regression run. 

^― — 一「 Table 9.2, column 8 , contains the PRESS p values for all possible regression models for the 

- statical unit example. The PRESS p values are plotted in Figure 9.4f. The message given 

by the PRESS p values in Table 9.2 and plot in Figure 9.4f is very similar to that told by 
the other criteria. We find that subsets (Xi, X 2 , X 3 ) and (X 2 , X 3 , X 4 ) have small PRESS 
values; in fact, the set of all X variables (X\, X 2 , X 3 , X 4 ) involves a slightly larger PRESS 
value than subset (Xi, X 2 , JC 3 ). The subset (X 2 , X 3 , X 4 ) involves a PRESS value of 4.597, 
which is moderately larger than the PRESS value of 3.914 for subset ; (Xi, X 2 , X 3 ). 

Comment 

PRESS values can also be useful for model validation, as will be explained in Section 9.6. ^ _ 


9.4 Automatic Search Procedures for Model Selection_ 

• As noted in the previous section, the number of possible models, 2 P ~\ grows rapidly with 
the number of predictors. Evaluating all of the possible alternatives can be a daunting 
endeavor. To simplify the task, a variety of automatic computer-search procedures have 
been developed. In this section, we will review the two most common approaches, namely 
“best” subsets regression and stepwise regression. 

For the remainder of this chapter, we will employ the full set of eight predictors from 
the surgical unit data. Recall that these predictors are displayed in Table 9.1 on page 350 
and described there as well. 


"Best" Subsets Algorithms 

Time-saving algorithms have been developed in which the best subsets according to a 
specified criterion are identified without requiring the fitting of all of the possible subset 
regression models. In fact, these algorithms require the calculation of only a small fraction 
of all possible regression models. For instance, if the C p criterion is to be employed and the 
five best subsets according to this criterion are to be identified, these algorithms search for 
the five subsets of X variables with the smallest C p values using much less computational 
effort than when all possible subsets are evaluated. These algorithms are called “best” 
subsets algorithms. Not only do these algorithms provide the best subsets according to the 
specified criterion, but they often also identify several “good” subsets for each possible 
number of X variables in the model to give the investigator additional helpful information 
in making the final selection of the subset of X variables to be employed in the regression 
model. „ 

When the pool of potential X variables is very large, say greater than 30 or 40, even 
the “best” subset algorithms may require excessive computer trtne. Under these conditions, 
one of the stepwise regression procedures, described later in this section, may need to be 
employed to assist in the selection of X variables. 



For the eight predictors in the surgical unit example，we know there are 2 8 = 256 possible 
models. Plots of the six model selection criteria discussed in this chapter are displayed in 


362 Part Two Multiple Linear Regression 


FIGURE 9.5 

Plot of Variable 
Selection 
Criteria with 
All Eight 
Predictors — 
Surgical Unit 
Example. 








P 

(e) 


P 

(f) 


Figure 9.5、The best values of each criterion for each p have been connected with solid 
lines. These best values are also displayed in Table 93. The overall optimum criterion values 
have been underlined in each column of the table. Notice that the choice of a “best” model 
depends on the criterion. For example, a seven- or eight-parameter model is identified as 
best by the p criterion (both have max(i?^ p ) — ,823), a six-parameter model is identified 
by the C p criterion (min(C 7 ) = 5.541), and a seven-parameter model is identified by the 
AICp criterion (min(A/C 7 ) = —163.834). As is frequently the case, the SBC p criterion 
identifies a more parsimonious model as best. In this case both the SBC P and PRESS p criteria 
point to five-parameter models (min(5BC5) = —153,406 and m\n(PRESS 5 ) — 2.738). As 
previously emphasized, our objective at this stage is not to identify a single best model; we 
hope to identify a small set of promising models for further study. 

Figure 9.6 contains, for the surgical unit example, MINITAB output for the “best” subsets 
algorithm. Here, we specified that the best two subsets be identified for each number of 
variables in the regression model. The MINITAB algorithm uses the Rj 7 criterion, but also 
shows for each of the “best” subsets the R^- p , C p , and yjMSE p (labeled S) values. The 
right-most columns of the tabulation show the X variables in the subset. From the figure 
it is seen that the best subset, according to the p criterion, is either the seven-parameter 



Chapter 9 Building the Regression Model I: Model Selection and Validation 363 


Vars 


E-Sq 

E-Sq(adj) 

C-p 

42.8 

41.7 

117.4 

42.2 

41.0 

119.2 

66.3 

65.0 

50.5 

59.9 

58.4 

69.1 

77.8 

76.5 

18.9 

75.7 

74.3 

25.0 

83.0 

81.6 

5.8 

81.4 

79.9 

10.3 

83.7 

82.1 

5.5 

83.6 

81.9 

6.0 

84.3 

82.3 

5.8 

83.9 

81.9 

7.0 

84.6 

82.3 

7.0 

84.4 

82.0 

7.7 

84.6 

81.9 

9.0 


FIGURE 9.6 
minitab 

Output for 
«Best” Two y 
Subsets for 
Each Subset 
Size—Surgical 
lit Examp 


Uniti 


iple. 


Response is lnSurviv 


S 


BP 
1 r 


H 


H 


o o E G i s 

o g n L e s t 

d. i z i nth 

cnyvAdme 
ldmegeoa 
oeererdv 


0.37549 
0.37746 
0.29079 
0.31715 
0.23845 
0.24934 
0.21087 
0.22023 
0.20827 
0.20931 
0.20655 
0.20934 
0.20705 
0.20867 
0.20927 


X 

X 

XX 

XX 

XX 

XXX 

XXX 

XXX 

XXX 

XXX 

XXX 

XXX 

XXX 


X 

X 
X 

X X 

X X 

XX X 

XXX 
X X X X 
X X X X X X X 

xxxxxxxx 


model based on all predictors except Liver (X 4 ) and Histmod (history of moderate alcohol 
use — 义 7 )， or the eight-parameter model based on all predictore except Liver (X 4 ). The p 
criterion value for both of these models is .823. 

The all-possible-regressions procedure leads to the identification of a small number of 
subsets that are “good” according to a specified criterion. In the surgical unit example, two 
of the four criteria — SBC P and PRESS p — pointed to models with 4 predictors, while the 
other criteria favored larger models. Consequently, one may wish at times to consider more 
than one criterion in evaluating possible subsets of X variables. 


⑶ 

(4) 

(5) 

⑹ 

(7) 

R a,P 


AIC p 

SBC p 

PRESS p 

0 .Q 00 

240.452 

-75.703 

-73.714 

13.296 

0.41 7 

117.409 

-103.827 

-99.849 

8,025 

0.650 

50.472 

-130.483 

—124.516 

5.065 

0.765 

18.914 

-150.985 

-143.029 

3.469 

0.816 

5.751 

-163.351 

-153:406 

2.738 

0.821 

5.541 

-163.805 

-151.871 

2.739 

0.823 

5.787 

-163.834 

—149.911 

2.772 

0.823 

7.029 

-162.736 

私 .824 

2.809 

0.819 

9.000 

-160.771 

-^142.870 

2.931 


TABLE 9.3 

jest Variable- 
Selection 

Criterion 

Values 一 
gm^ical Unit 
Example* 


( 1 ) 

p S$E P 

1 12 . 8 Q 8 

2 7.332 

3 4.312 

4 2.843 

5 2.179 

6 2.082 

7 2.005 

8 1.972 

9 1.971 


112233445566778 


083^807366 
02 p o 2 6 7 3 3 4 4 
riR 046788888 

000000000 



364 Part Two Muitipie Linear Regression 


Once the investigator has identified a few “good” subsets for intensive examination a 
final choice of the model variables must be made. This choice, as indicated by our model- 
building strategy in Figure 9.1, is aided by residual analyses (and other diagnostics to be 
covered in Chapter 10) and by the investigator’s knowledge of the subject under study, and 
is finally confirmed through model validation studies. 

Stepwise Regression Methods 

In those occasional cases when the pool of potential X variables contains 30 to 40 or even 
more variables, use of a “best” subsets algorithm may not be feasible. An automatic search 
procedure that develops the “best” subset of X variables sequentially may then be helpful 
The forward stepwise regression procedure is probably the most widely used of the automatic 
search methods. It was developed to economize on computational efforts, as compared with 
the various all-possible-regressions procedures. Essentially, this search method develops a 
sequence of regression models, at each step adding or deleting an X variable. The criterion 
for adding or deleting an X variable can be stated equivalently in terms of error sum of 
squares reduction, coefficient of partial correlation, t* statistic, or F* statistic. 

An essential difference between stepwise procedures and the “best” subsets algorithm is 
that stepwise search procedures end with the identification of a single regression mode] as 
“best.” With the “best” subsets algorithm, on the other hand, several regression models can 
be identified as “good” for final consideration. The identification of a single regression model 
as “best” by the stepwise procedures is a major weakness of these procedures. Experience 
has shown that each of the stepwise search procedures can sometimes err by identifying a 
suboptimal regression model as “best.” In addition, the identification of a single regression 
model may hide the fact that several other regression models may also be “good.” Finally, 
the “goodiiess” of a regression model can only be established by a thorough examination 
using a variety of diagnostics. 

What then can we do on those occasions when the pool of potential X variables is \ary 
large and an automatic search procedure must be utilized? Basically, we should use the 
subset identified by the automatic search procedure as a starting point for searching for 
other “good” subsets. One possibility is to treat the number of X variables in the regression 
model identified by the automatic search procedure as being about the right subset size and 
then use the “best” subsets procedure for subsets of this and nearby sizes. 

Forward Stepwise Regression 

We shall describe the forward stepwise regression search algorithm in terms of the f 
statistics (2.17) and their associated 尸 -values for the usual tests of regression parameters. 

1. The stepwise regression routine first fits a simple linear regression model for each of 
the P — 1 potential X variables. For each simple lineal'regression mode], the〆 statistic (2.17) 
for testing whether or not the slope is zero is obtained: 

— h 

► f A = - 

" s{b k } 

The X variable with the lai'gest r* value is the candidate for first addition. If this f value 
exceeds a predetermined level, or if the corresponding P-value is less than a predeter¬ 
mined a, the X variable is added. Otherwise, the program terminates with no X variable 




Chapter 9 Building the Regression Model I: Model Selection and Validation 365 


considered sufficiently helpful to enter the regression model. Since the degrees of freedom 
associated withM5!E vary depending on the number of X variables in the model, and since 
repeated tests on the same data are undertaken, fixed t* limits for adding or deleting a 
variable have no precise probabilistic meaning. For this reason, software programs often 
favor the use of predetermined a-limits. 

2. Assume X 1 is the variable entered at step 1, The stepwise regression routine now 

fits all regression models with two X variables, where X 7 is one of the pair. For each 
such regression model, the t* test statistic corresponding to the newly added predictor Xk 
is obtained- This is the statistic for testing whether or not /5k = Q when X 1 and X k are 
the variables in the model. The X variable with the largest t* vaffie — or equivalently, the 
smallest P-value — is the candidate for addition at the second stage ： If this t* value exceeds 
a predetermined level (i.e., the P-value falls below a predetermined level), the second X 
variable is added. Otherwise, the program terminates. ^ 

3. Suppose X 3 Is added at the second stage. Now the stepwise regression routine examines 
whether any of the other X variables already in the model should be dropped. For our 
illustration, there Is at this stage only one other X variable in the model, X 7 , so that only 
one t* test statistic Is obtained: 



(9.19) 


At later stages, there would be a number of these t* statistics, one for each of the variables 
in the model besides the one last added. The variable for which this t* value is smallest (or 
equivalently the variable for which the P-value is largest) is the candidate for deletion. If 
this t* value falls below—or the P-value exceeds — a predetermined limit, the variable is 
dropped from the model; otherwise, it is retained. 

4. Suppose Xi is retained so that both X 3 and X 1 are now in the model. The stepwise 
regression routine now examines which X variable is the next candidate for addition, then 
examines whether any of the variables already in the model should now be dropped, and 
so on until no further X variables can either be added or deleted, at which point the search 
terminates. 


Note that the stepwise regression algorithm allows an X variable, brought Into the model 
at an earlier stage, to be dropped subsequently if it is no longer helpful in conjunction with 
variables added at later stages. 



Figure 9.7 shows MINTTAB computer printout for the forward stepwise regression proce¬ 
dure for the surgical unit example. The maximum acceptable a limit for adding a variable 
is 0.10 and the minimum acceptable a limit for rejnoving a variable is 0.15, as shown at the 
top of Figure 9.7. 

We now follow through the steps/ $ 


1. At the start of the stepwise search, no X variable is in the model so that the model 
to be fitted is y f = + e；. In step 1 , the t* statistics (9.18) and corresponding P-values 

are calculated for each potential X variable, and the predictor having the smallest P-value 
(largest t* value) is chosen to enter the equation. We see that Enzyme (X 3 ) had the largest 



366 Part Two Multiple Linear Regression 


FIGURE 9.7 
MINITAB 
Forward 
Stepwise 
Regression 
Output — 
Surgical Unit 
Example. 


Alpha-to-Enter : 0,1 Alpha-to-Eemove : 0.15 


Response is lnSurviv on 8 predictors, with N = 54 


Step 

1 

2 

3 

4 

Constant 

5.264 

4.351 

4.291 

3.852 

Enzyme 

0.0151 

0.0154 

0.0145 

0.0155 

T-Value 

6.23 

8.19 

9.33 

11.07 

P-Value 

0.000 

0.000 

0.000 

0.000 

Proglnde 


0.0141 

0.0149 

0.0142 

T-Value 


5.98 

7.68 

8.20 

P-Value 


0.000 

0.000 

0.000 

Histheav 



0.429 

0.353 

T-Value 



5.08 

4.57 

P-Value 



0.000 

0.00(? 

Bloodclo 




〆 0.073 

T-Value 




3,86 

P-Value 




0,000 

S 

0,375 

0.291 

0.238 

0.211 

R-Sq 

42.76 

66.33 

77.80 

82.99 

R-Sq(adj) 

41.66 

65.01 

76.47 

81.60 

C_p 

117.4 

50.5 

18.9 

5.8 


test statistic: 


s{b 3 } .002427 




The P-value for this test statistic Is 0.000, which falls below the maximum acceptable 
a-to-enter value of 0.10; hence Enzyme (X 3 ) is added to the model. 

2. At this stage, step 1 has been completed. The current regression model contains 
Enzyme (X 3 ), and the printout displays, near the top of the column labeled “Step 1，’’ the 
regression coefficient for Enzyme (0.0151)，the t* value for this coefficient (6.23), and the 
corresponding P-value (0.000). At the bottom of column 1， a number of variables-selection 
criteria, including (42.76), (41.66), and Ci (117.4) are also provided. 

Next, all regression models containing X 3 and another X variable are fitted, and the t* 
statistics calculated- They are now: 


jMSR(X k \X 3 ) 
V MSE(X^X k ) 


Progindex (X 2 ) has the highest t* value, and its P-value (0.000) falls below 0.10, so that 
X 2 now enters the model. 



Chapter 9 Building the Regression Model I: Model Selection and Validation 367 


3. The column labeled Step 2 in Figure 9.7 summarizes the situation at this point. Enzyme 
and Progindex (X 3 and X 2 ) are now In the model, and information about this model is 
provided. At this point, a test whether Enzyme (X 3 ) should be dropped is undertaken, but 
because the P-value (0.000) corresponding to X 3 is not above 0.15, this variable Is retained. 

4. Next, all regression models containing X 2 , X 3 , and one of the remaining potential X 
variables are fitted. The appropriate t* statistics now are: 

广二 jMSR{X k \X 2 , X^) 
k — ]j MSE(X 2 , X 3 ~X k ) 

The predictor labeled Histheavy (Xg) had the largest value ^P-value = 0.000) and 
was next added to the model. 

5. The column labeled Step 3 in Figure 9.7 summarizes the situation at this point. X%, 
X 3 , and Xs are now in the model. Next, a test is undertaken to determine whetfcer X% or 
X 3 should be dropped. Since both of the corresponding"P-values are less than 0.15, neither 
predictor is dropped from the model. 

6 . At step 4 Bloodclot (Xi) is added, and no terms previously included were dropped. 

The right-most column of Figure 9.7 summarizes the addition of variable X x into the model 
containing variables X 2 , X 3 , and X 8 . Next, a test is undertaken to determine whether either 
X 2 , X 3 , or Xs should be dropped. Since all f-values are less than 0.15 (a]] are 0.000), all 
variables are retained. J 

7. Finally, the stepwise regression routine considers adding one of X 4 , X 5 , X 6 , or X 7 to 
the model containing X\, X 2 , X 3 , and Xg. In each case, the P-values are greater than 0.J0 
(not shown); therefore, no additional variables can be added to the model and the search 
process is terminated. 

Thus, the stepwise search algorithm identifies (Xj, X 2 , X3, X s ) as the “best” subset of 
X variables. This model also happens to be the model identified by both the SBC p and 
PRESS p criteria in our previous analyses based on an assessment of “best” subset selection. 


Comments 

1. The choice of of-to-enter and a-to-remove values essentially represents a balancing of opposing 
tendencies. Simulation studies have shown that for large pools of uncorrelated predictor variables that 
have been generated to be uncorrelated with the response variable, use of large or moderately large 
a-to-enter values as the entry criterion results in a procedure that is too liberal; that is, it allows 
too many predictor variables into the model. On the other hand, models produced by an automatic 
selection procedure with small a-to-enter values are often underspecified, resulting in a 2 being badly 
overestimated and the procedure being too conservative (see, for example, References 9.2 and 9.3). 

2. The maximum acceptable or-to-enter value should never be larger than the minimum acceptable 
a-to-remove value; otherwise, cycling is possible where el variable is continually entered and removed. 

3. The order in which variables enter the regression ( mcdel does not reflect their importance. At 

times, a variable may enter the model, only to be dropped at a later stage because it can be predicted 
well from the other predictors that have befen subsequently added. * ■ 

Other Stepwise Procedures 

Other stepwise procedures are available to find a “best” subset of predictor variables. We 
mention two of these. 



368 Part Two Multiple Linear Regression 


Forward Selection. The forward selection search procedure is a simplified version of 
forward stepwise regression, omitting the test whether a variable once entered into the 
model should be dropped. 

Backward Elimination. The backward elimination search procedure is the opposite of 
forward selection. It begins with the model containing all potential X variables and identifies 
the one with the largest P-value. If the maximum P-value is greater than a predetermined 
limit, that X variable is dropped. The model with the remaining P — 2 X variables i s 
then fitted, and the next candidate for dropping is identified. This process continues until 
no further X variables can be dropped, A stepwise modification can also be adapted that 
allows variables eliminated eai'lier to be added later; this modification is called the backward 
stepwise regression procedure. 

Comment 

For small and moderate numbers of variables in the pool of potential X variables, some statisticians 
argue for backward stepwise search over forward stepwise search (see Reference 9.4). A poteitial 
disadvantage of the forward stepwise approach is that t|ie MSE — and hence 5 {/?/；} — will lend to be 
inflated during the initial steps, because important predictors have been omitted. This in turn leads 
to tl test statistics (9.18) that are too small. For the backward stepwise procedure, MSE values tend 
to be more nearly unbiased because important predictors are retained at each step. An argument in 
favor of the backward stepwise procedure can also be made in situations where it is useful as a firsi 
step to look at each X variable in the regression function adjusted for all the other X variables in 
the pool. ■ 

9.5 Some Final Comments on Automatic 

Model Selection Procedures_ 

Our discussion of the major automatic selection procedures for identifying the “best” subset 
of X variables has focused on the main conceptual issues and not on options, variations, 
and refinements available with particular computer packages. It is essential that the specific 
features of the package employed be fully understood so that intelligent use of the package 
can be made. In some packages, there is an option for regression models through the origin. 
Some packages permit variables to be brought into the model and tested in pairs or other 
groupings instead of singly, to save computing time or for other reasons. Some packages, 
once a “best” regression model is identified, will fit all the possible regression models with 
the same number of variables and will develop information for each model so that a final 
choice can be made by the user. Some stepwise programs have options for forcing variables 
into the regression model; such variables are not removed even if their P-values become 
too large. 

The diversity of these options and special features serves to emphasize a point made 
earlier: there is no unique way of searching for “good” subsets of X variables, and subjective 
elements must play an important role in the search process. 

We have considered a number of important Issues related to exploratory model building, 
but there are many others, (A good discussion of many of these issues maybe found inRefer- 
. ence 9.5,) Most important for good model building is the recognition that no automatic search 

procedure will always find the “best” model, and that, indeed, there may exist several “good” 
regression models whose appropriateness for the purpose at hand needs to be investigated. 



Chapter 9 Building the Regression Model I: Model Selection and Validation 369 


y 


Judgment needs to play an important role in model building for exploratory studies. 
Some explanatory variables may be known to be more fundamental than others and there¬ 
fore should be retained in the regression model if the primary purpose is to develop a good 
explanatory model. When a qualitative predictor variable is represented in the pool of poten¬ 
tial X variables by a number of indicator variables (e.g.，geographic region is represented by 
several indicator variables), it is often appropriate to keep these indicator variables together 
as a group to represent the qualitative variable, even if a subset containing only some of 
the indicator variables is “better*，according to the criterion employed. Similarly, if second- 
order terms X\ or interaction terms need to be present in a regression model, one 

would ordinarily wish to have the first-order terms in the model as representing the main 


effects. 

• . f m 
The selection of a subset regression model for exploratory observational studies has 

been the subject of much recent research. Reference 9,5 provides information about many 


of these studies. New methods of identifying the <e best” subset have been proposed^including 
methods based on deleting one case at a time and on bootstrapping. With the first method, the 
criterion is evaluated for identified subsets n times, each time with one case omitted, in order 
to select the “best” subset. With bootstrapping, repeated samples of cases are selected with 
replacement from the data set (alternatively, repeated samples of residuals from the model 
fitted to all X variables are selected with replacement to obtain observed Y values), and the 


criterion is evaluated for identified subsets in order to select the “best” subset. Research 


by Breiman and Spector (Ref. 9,7) has evaluated these methods from the standpoint of the 
closeness of the selected model to the true model and has found the two methods promising, 
the bootstrap method requiring larger data sets. 

An important issue in exploratory model building that we have not yet considered is 
the bias in estimated regression coefficients and in estimated mean responses, as well as in 
their estimated standard deviations, that may result when the coefficients and error mean 
square for the finally selected regression model are estimated from the same data that were 
used for selecting the model. Sometimes, these biases may be substantial (see, for example. 
References 9.5 and 9.6). In the next section, we will show how one can examine whether the 
estimated regression coefficients and error mean square are biased to a substantial extent. 


9.6 Model Validation 


The final step in the model-building process is the validation of the selected regression 
models. Model validation usually involves checking a candidate model against independent 
data. Three basic ways of validating a regression model are; 

1, Collection of new data to check the model and its predictive ability. 

2, Comparison of results with theoretical expectations, earlier empirical results, and 
simulation results, 

9 

3, Use of a holdout sample to check the model and its predictive ability, 

- 备 

When a regression model is used in a controlled experiment, a repetition of the experiment 
and its analysis serves to validate the findings in the initial study if similar results for the 
regression coefficients, predictive ability, and the like are obtained. Similarly, findings in 
confirmatory observational studies are validated by a repetition of the study with other data. 




370 Part Two Multiple Linear Regivssion 


As we noted in Section 9J, there are generally no extensive problems in the selection of 
predictor variables in controlled experiments and confirmatory observational studies, l n 
contrast, explanatory observational studies frequently involve large pools of explanatory 
variables and the selection of a subset of these for the final regression model. For these 
studies, validation of the regression mode] involves also the appropriateness of the variables 
selected, as well as the magnitudes of the regression coefficients, the predictive ability of 
the model, and the like. Our discussion of validation will focus primarily on issues that arise 
in validating regression models for exploratory observational studies. A good discussion 
of the need for replicating any study to establish the generalizability of the findings may 
be found in Reference 9.8. References 9.9 and 9.10 provide helpful presentations of issues 
arising in the validation of regression models. 

Collection of New Data to Check Model 

The best means of model validation is through the collection of new data. The purpose 
of collecting new data is to be able to examine whether the regression model developed 
from the earlier data is still applicable for the new data. If so, one has assurance about the 
applicability of the model to data beyond those on which the model is based. 

Methods of Checking Validity. There are a variety of methods of examining the validity 
of the regression model against the new data. One validation method is to reestimate the 
model form chosen earlier using the new data. The estimated regression coefficients and 
various characteristics of the fitted model are then compared for consistency to those of the 
regression model based on the earlier data. If the results are consistent, they provide strong 
support that the chosen regression model is applicable under broader circumstances than 
those related to the original data. 

A second validation method is designed to calibrate the predictive capability of the 
selected regression model. When a regression model is developed from given data, it is 
inevitable that the selected model is chosen, at least in large part, because it fits well the 
data at hand. For a different set of random outcomes, one may likely have arrived at a 
different model in terms of the predictor variables selected and/or their functional forms 
and interaction terms present in the model. A result of this model development process is 
that the error mean square MSE will tend to understate the inherent variability in making 
future predictions from the selected model. 

A means of measuring the actual predictive capability of the selected regression model 
is to use this model to predict each case in the new data set and then to calculate the mean 
of the squared prediction errors, to be denoted by MS PR, which stands for mean squared 
prediction error: 

MSPR= 奸 (9.20) 

n* 

where: 

Yi is the value of the response variable in the /th validation case 
1 Y, is the predicted value for the ith validation case based on the model-building data set 

n* is the number of cases in the validation data set 



Chapter 9 Building the Regression Model I: Model Selection and Validation 371 


If the mean squared prediction error MS/ 3 /? is fairly close to M5E 1 based on the regression 
fit to the model-building data set，then the error mean square MSE for the selected regression 
model is not seriously biased and gives an appropriate indication of the predictive ability of 
the model. If the mean squared prediction error is much larger than MSE ，one should rely 
on the mean squared prediction error as an indicator of how well the selected regression 
model will predict in the future. 

Difficulties in Replicating a Study. Difficulties often arise when new data are collected 
to validate a regression model, especially with observational studies. Even with controlled 
experiments, however, there may be difficulties in replicating an earlier study in identical 
fashion. For instance, the laboratory equipment for the new study to be conducted in a 
different laboratory may differ from that used in the initial study, resulting in somewhat 
different calibrations for the response measurements. 

The difficulties in replicating a study are particularly acute in the social sciences where 
controlled experiments often are not feasible. Repetition of an observational study usually 
involves different conditions, the differences being related to changes in setting ana/or time. 
For instance, a study investigating the relation between amount of delegation of authority 
by executives in a firm to the age of the executive was repeated in another firm which 
has a somewhat different management philosophy. As another example, a study relating 
consumer purchases of a product to special promotional incentives was repeated in another 
year when the business climate differed substantially from that during the initial study. 

It maybe thought that an inability to reproduce a study identically makes the replication 
study useless for validation purposes. This is not the case. No single study is fully useful 
until we know how much the results of the study can be generalized. If a replication study for 
which the conditions of the setting differ only slightly from those of the initial study yields 
substantially different regression results, then we learn that the results of the initial study 
cannot be readily generalized. On the other hand, if the conditions differ substantially and 
the regression results are still similar, we find that the regression results can be generalized to 
apply under substantially varying conditions. Still another possibility is that the regression 
results for the replication study differ substantially from those of the initial study, the 
differences being related to changes in the setting. This information may be useful for 
enriching the regression model by including new explanatory variables that make the model 
more widely applicable. 


Comment 

When the new data are collected under controlled conditions in an experiment, it is desirable to include 
data points of major interest to check out the model predictions. If the model is to be used for making 
predictions over the entire range of the X observations, a possibility is to include data points that are 
uniformly distributed over the X space. 1 ■ 

^©mparison with Theory, Empirical Evidence, or Simulation Results 

In some cases, theory, simulation results, or previous empirical results may be helpful in 
determining whether the selected model is reasonable. Comparisons of regression coeffi¬ 
cients and predictions with theoretical expectations, previous empirical results, or simulation 



372 Part Two Multiple Linear Re^tr.ssioti 


results should be made. Unfortunately, there is often little theory that can be used to validate 
regression models. 

Data Splitting 

By far the preferred method to validate a regression model is through the collection of new 
data. Often, however, this is neither practical nor feasible. An alternative when the dataset 
is large enough is to split the data into two sets. The first set, called the model-building S eUx 
the training sample, is used to develop the model. The second data set, called the validation 
or prediction set, is used to evaluate the reasonableness and predictive ability of the selected 
model. This validation procedure is often called cross-validation . Data splitting in effect is 
an attempt to simulate replication of the study. 

The validation data set is used for validation in the same way as %hen new data are 
collected. The regression coefficients can be reestimated for the selected model and then 
compared for consistency with the coefficients obtained from the model-building data set 
Also, predictions can be made for the data in the validation data set from the regression 
model developed from the model-building data set-to calibrate the predictive ability of this 
regression model for the new data. When the calibration data set is large enough, one can 
also study how the “good” models considered in the model selection phase fare with the 
new data. 

Data sets are often split equally into model-building and validation data sets. It is impor 、 
tant, however, that the model-building data set be sufficiently large so that a reliable mode] 
can be developed. Recall in this connection that the number of cases should be at least 6 to 
10 times the number of variables in the poo] of predictor variables. Thus, when 10 variables 
are in the pool, the model-building data set should contain at least 60 to 100 cases. If the 
entire data set is not laige enough under these circumstances for making an equal split, the 
validation data set will need to be smaller than the model-building data set. 

Splits of the data can be made at random. Another possibility is to match cases in pairs 
and place one of each pair into one of the two split data sets. When data are collected 
sequentially in time, it is often useful to pick a point in time to divide the data. Generally, 
the earlier data are selected for the model-building set and the later data for the validation 
set. When seasonal or cyclical effects are present in the data (e.g.，sales data), the split 
should be made at a point where the cycles are balanced. 

Use of time or some other characteristic of the data to split the data set provides the 
opportunity to test the generalizability of the model since conditions may differ for the two 
' data sets. Data in the validation set may have been created under different causal conditions 

than those of the model-building set. In some cases, data in the validation set may represent 
extrapolations with respect to the data in the model-building set (e.g., sales data collected 
over time may contain a strong trend component). Such differential conditions may lead to 
a lack of validity of the model based on the model-building data set and indicate a need to 
broaden the regression model so that it is applicable under a broader scope of conditions. 

A possible drawback of data splitting is that the variances of the estimated regression 
coefficients developed from the model-building data set will usually be lai^er than those 
that would have been obtained from the fit to the entire data set. If the model-building data 
set is reasonably large, however, these variances generally will not be that much larger than 
those for the entire data set. In any case, once the model has been validated, it is customary 
practice to use the entire data set for estimating the final regression model. 



Chapter 9 Building the Regression Model I: Model Selection and Validation 373 


^77 In the surgical unit example, three models were favored by the various model-selection 
- criteria. The SBC P and PRESS p criteria favored the four-predictor model: 

G 二 A) + 灼 Xn + ^X i2 + + + ^ Model 1 (9.21) 

C p was minimized by the five-predictor model: 

= A) + + 戶 3 X 13 + + + Model 2 (9.22) 

while the R^ p and AIC P criteria were optimized by the six-predictor model: 

G = A) + PiXii + ^ 2 X 12 + 和 Xi 3 + AX。+ pe^te + 你兄8 Model 3 (9.23) 

We wish to assess the validity of these three models, both intern 缚 ly and externally. 

Some evidence of the internal validity of these fitted models can be obtained through 
an examination of the various model-selection criteria. Table 9.4 siummarizes the fits of 
the three candidate models to the original (training) data set in columns (1) ， (3),jand (5). 
We first consider the SSE p , PRESS p and C p criterion values. Recall that the PRESS p value 
is always larger than SSE p because the regression fit for the ith case when this case is 
deleted in fitting can never be as good as that when the ith case is included. A PRESS p 


TABLE 9.4 Regression Results for Candidate Models (9.21), (9.22), and (9.23) Based on Model-Building and 
Validation Data Sets — Surgical Unit Example. 


A 

j] 

2 . 


1 

m 




r- ^ ^ 






0) 

(2) 

Model 1 

iVlqdel 1 

Training 

Validation 

- • -- - 

Data Set 

• 、 f- t.y - » 

Data Set 

5 

5 

3^8524 

3.6350 

On 927 

0*2894 

0 朋 3 

0.0958 

0 负⑽ 

0.0319 

_1 徽 

0.0164 

ip.0017 

0.0023 

0.0155 

0.0156 

G.0014 

0.0020 

,0^530 

0.1860 

0.0772 

0,0964 

2.1788 

3.7951 

2.7378 

43219 

5.7508 

6,2094 

. ■- ? .5* - - -- 

0 綱 5 

0.0775 

0:0773 


0.8160 

6:^824 


⑶ 

(4) 

Model 2 

Model 2^ 

Training 

ValidatiQh 

Data Set 

Data Set 

a •• 

6 

6 

3.8671 

3.6143 

0.1906 

0.2907 

0.Q712 

0.0999 

0.0188 

0.0323 

0.0139 

0.0159 

0.0017 

0.0024 

0.0151 

0.0154 

6.0014 

0.0020 

0.086? 

0.0731 

0.0582 

0.0792 

0.3627 * 

Q.1 S86 

'o.ozes 

0.0966 

2.082Q, 

3,7288 

2,7827 

4.6536 

5.5406 

7.3331 

0.0434- 

0^)777 

&0764" 

— 

0.8205 

0.6815 


(5) 

⑹ 

Model 3 

Model 3 

Training 

Validation 

Data Set 

Data Set 

7 

7 

4.0540 ： 

3.4699 

0.2348 

^0.3468 

0.0715 

0.0987 

0.0186 

0.0325 

0.0138 

0.0162 

0.0017 

0.0024 

0.0151 

0.0156 

0.0014 

0.0021 

-^0,003^ 

0.0025 

0.0026 

0.Q033 

0^0873 ； 

0.Q727 - 

0.0577 

0.0795 

0.35G9 

6.1931 

0.0764 

0.0972 

2.G052 

3.6822 

2.7723 

4.8981 

5.7874 

8.7166 


0.0783 

0.0794 

.. 

0^34 

0.6787 





374 ..F^art Tvyo Mnhipie Linear Reiiies.sion 


Survival 


Time 


n 

Y i ::: In Y s 

302 

5.710 

767 

6.642 

487 

6.188 

655 

1 i - 

6.485 

377 

5.932 

'，642 

6.465 


Blood- 


Case Clotting Prognostic Enzyme Liver 


Number 

Score 

Index 

Test 

Test 

Age 

Gender 

• 

/ 

Xn 

X,2 

入 /3 

X f4 

入 ， 5 

X/6 

55 

7.1 

23 

78 

1.93 

45 

0 

56 

4.9 

66 

91 

3.05 

34 

1 

57 

6.4 

90 

35 

1.06 

39 

1 

106 

6.9 

90 

33 

2.78 

48 

1 

107 

7.9 

45 

55 

2.46 

43 

0 

108 

4.5 

68 

60 

2.07 

59 

0 


value reasonably close to SSE r supports the validity of the fitted regression model and of 
MSE p as an indicator of the predictive capability of this model. In this case, all three of the 
candidate models have PRESS p values that are reasonably close to SSE,,. For example, for 
Model I, PRESS p — 2.7378 and SSE p — 2. 1788. Recall also that if C p ^ p, this suggests that 
there is little or no bias in the regression model. This is the case for the three models under 
consideration. The Cs, C 6 , and C 7 values for the three models are, respectively, 5.7508 
5.5406, and 5.7874. ’ 

To validate the selected regression model externally, 54 additional cases had been held 
out for a validation data set. A portion of the data for these cases is shown in Table 9.5, The 
correlation matrix for these new data (not shown) is quite similar to the one in Figure 9.3 for 
the mode I-building data set. The estimated regression coefficients, their estimated standard 
deviations, and various model-selection criteria when regression models (9.2 1 ), (9.22), and 
(9.23) are fitted to the validation data set are shown in Table 9.4, columns 2, 4, and 6 . 
Note the excellent agreement between the two sets of estimated regression coefficients, and 
the two sets of regression coefficient standard errors. For example, for Model ] fit to the 
training data, b\ — .0733; when fit to the validation data, we obtain b\ = .0958. In view 
of' the magnitude of the corresponding standard errors (.0190 and .0319), these values are 
reasonably close. 

A review of Table 9.4 shows that most of the estimated coefficients agree quite closely. 
However, it is noteworthy that in Model 3 — the coefficient of age — is negative for the 
training data (/) 5 二 —0.0035), and positive for the validation data Uh = 0.0025). THsis 
certainly a cause for concern, and it raises doubts about the validity of Model 3. 

To calibrate the predictive ability of the regression models fitted from the training data 
set, the mean squared prediction errors MSPR in (9.20) were calculated for the 54 cases in 
the validation data set in Table 9.5 for each of the three candidate models; they are .0773, 
,0764, and .0794, respectively. The mean squared prediction error generally will be larger 
than MSE p based on the training data set because entirely new data are involved in the 
validation data set. In this case, the relevant MSE p values for the three models are .0445, 
.0434, and .0427 - The fact that MSPR here does not differ too greatly from MSE p implies 
that the error mean square MSE p based on the training data set is a reasonably valid indicator 
of the predictive ability of the fitted regression model. The closeness of the three MSPR 


TABLE 9.5 Potential Predictor Variables and Response Variable~Surgical Unit Example. 


c.e: 
s 


He 




00 


000 


Alc.use:t d 


T 


00 




Chapter 9 Building the Regression Model I: Model Selection and Validation 375 


values suggest that the three candidate models perform comparably in terms of predictive 
accuracy. 

Asa consequence of the concerns noted earlier about Model 3, this model was eliminated 
from further consideration. The final selection was based on the principle of parsimony. 
While Models 1 and 2 performed comparably in the validation study, Model 1 achieves this 
level of performance with one fewer parameter. For this reason, Model ] was ultimately 
chosen by the investigator as the final model. 

Comments 

1. Algorithms are available to split data so that the two data sets have similar statistical properties. 

The reader is referred to Reference 9.11 for a discussion of this and other issues associated with 
validation of regression models. 、 

2. Refinements of data spl itting have been proposed. With the double cross-validation procedure, 
for example, the model is built for each half of the split data and then tested on the other half of 
the data. Thus, two measures of consistency and predictive ability are obtained from the two fitted 
models. For smaller data sets, a procedure called K-fold cross-validation is often used. With this 
procedure, the data are first split into K roughly equal parts. For k = l,2,..., K,w& use the Ath part 
as the validation set, fit the model using the other k~ \ parts, and obtain the predicted sum of squares 
for error. The K estimates of prediction error are then combined to produce a K-fold cross-validation 
estimate. Note that when K = n, the iiT-fold cross-validation estimate is the identical to the PRESS p 
statistic. 

3. For small data sets where data splitting is impractical, the PRESS criterion in (9.17), considered 
earlier for use in subset selection, can be employed as a form of data splitting to assess the precision 
of model predictions. Recall that with this procedure, each data point is predicted from the least 
squares fitted regression function developed from the remaining n ~ l data points. A fairly close 
agreement between PRESS and SSE suggests that MSE may be a reasonably valid indicator of the 
selected model’s predictive capability. Variations of PRESS for validation have also been proposed, 
whereby m cases are held out for validation and the remaining n — m cases are used to fit the 
model. Reference 9.11 discusses these procedures, as well as issues dealing with optimal splitting of 
data sets. 

4. When regression models built on observational data do not predict well outside the range of 
the X observations in the data set, the usual reason is the existence of multicollinearity among the 
X variables. Chapter 11 introduces possible solutions for this difficulty including ridge regression or 
other biased estimation techniques. 

5. If a data set for an exploratory observational study is very large, it can be divided into three parts. 

The first part is used for model training, the second part for cross-validation and model selection, and 
the third part for testing and calibrating the final model (Reference 9.10). This approach avoids any 
bias resulting from estimating the regression parameters from the same data set used for developing 
the model. A disadvantage of this procedure is that the parameter estimates are derived from a smaller 
data set and hence are more imprecise than if the original data set were divided into two parts for 
model building and validation. Consequently, the division of a data set into three parts is used in 
practice only when the available data set is very large. _ 


Cited 

References 


9 丄 Daniel, C., and R S. Wood. Fitting Equations to Data: Compute r Analysis of Multifactor Data, 
2nd ed New York: John Wiley & Sons, 1999. 

9.2. Freedman, D. A. “A Note on Screening Regression Equations,” The American Statistician 37 
(1983), pp. 152-55. 


376 Part Two Multiple Linear Ref-rcssion 


9.3. 

9.4. 

9.5. 

9.6. 

9.7. 

9.8. 

9.9. 

9.10. 

9.11. 


Pope. P. T., undj.T. Webster. “The Use oluii F-Statistic in Stepwise Regression" lcchi] 0laef . 
14(1972), pp. 327-40. ncs 

Maniel, N. “Why Siepdown Procedures in Variable Select ion,'' Tcc/moineirics \i ([o-yr. 
pp. 621-25. — )’ 

Miller. A. J. Subsci Selection in Rei^rcysioii. 2nd ed. London; Cliapman und Hall. 2002 
Faraway, J. J. ''On the Cost ofData Analysis:V" s"vk" oj Computational and Gra/yhUal Statistic 
I (1992), pp. 213-29. 似 


Breiman, L.. and P, Spector. “Submodel Selection and Evaluation in Regression. The X-Rand 。 
Case :，Iniernationai Review 60 (1992), pp. 291 -319, m 

Lindsny, R. M., and A. S. C. Ehienberg. "The Design of Replicated Studies:’ The America 
Statistician 47 (1993), pp. 2 17^28. " " ^ 

Snee, R, D. “Validation or Regression Models: Methods and Examp ley, 1 ' lechnometrics 19 
(1977), pp. 415-28. ::' 

Hastie, T„ Tibshirani, R., and J. Friedman. The Elements of Statistical Learning ： Data Mining 
Inference, and Prediction, New York: Spi inger-Verlug. 2001. 

Stone. M. ''Cross-validatory Choice and Assessmem of Suitistical Prediction,'" Journal of the 
Royal Statistical Society B 36 (1974), pp. 1 11—47. 


Problems 


9.1. A speaker stated: “In well-designed experiments involving quantitative explanatory variables 
a procedtu e for reducing the number of explamuoiy variables after the data are obtained is not 
necessary: Discuss. 

9.2. The dean of a graduate school wishes to predict the grade point average in graduate work for 
recent applicants. List a dozen variables that might be useful explanatory variables here. 

9.3. Two researchers, investigating factors affecting summer attendance at privately operated 
beaches on Lake Ontario, collected information on attendance and I I explanatory vaiiables for 
42 beaches. Two summers were studied, of relatively hot und relatively cool weather, i^espec- 
tively. A “best” subsets algorithm now is to be used to reduce the nuniber of expUmatoty 
variables for the final iiegi'ession model. 

a. Should the variables reduction be done for both summers combined, or should it be done 
separately for each summer? Explain the problems involved and how you might handle 
them. 

b. Will the “best*’ subsets selection procedure choose those explanatory variables thnt are most 
important in a causal sense for detennining beach attendance? 

9.4. In forward stepwise regression, whiit advantage is there in using a relatively smaH cv-to-enter 
value for adding vaiiables? What udvantage is there Ui using a larger a-to-enter value? 

9.5. In forward stepwise regression, why should the a-to-enter value for adding variables never 
exceed the a-to-remove value for deleting variables? 

9.6. Prepare a fiowclniit of each of the following selection methods: (1) forward stepwise regression, 
(2) forward selection, (3) backward elimination. 

9.7. An engineer has stated: “Reduction or the number of explanatoiy variables should always be 
clone using the objective forward stepwise regi-ession procedure ” Discuss. 

9.8. An attendee ut a regression modeling short course stated: “I rai'ely see validation of regression 
models mentioned in published papers, s'o it musi really not be an important component of 
model building 广 Comment. 

^9.9. Refer to Patient satisfaction Problem 6.15. The hospital administrator wishes to determine the 
best subset ol' predictor variables for predicting patient satisfaction. 



Chapter 9 Building the Regression Model I: Model Selection and Validation 377 


a. Indicate which subset of predictor variables you would recommend as best for predicting 
patient satisfaction according to each of the following criteria: (l) p , (2) AIC P , (3) C p , 
(4) PRESS p . Support your recommendations with appropriate graphs. 
b‘ Do the four criteria in part (a) identify the same best subset? Does this always happen? 

c. Would forward stepwise regression have any advantages here as a screening procedure over 
the all-possible-regressions procedure? 

*9.10. Job proficiency. A personnel officer in a governmental agency administered four newly de¬ 
veloped aptitude tests to each of 25 applicants for entry-level clerical positions in the agency. 
For purpose of the study, all 25 applicants were accepted for positions irrespective of their test 
scores. After a probationary period, each applicant was rated for proficiency on the job. The 
scores on the four tests X 2 , X 3 , X 4 ) and the job proficiency score (7) for the 25 employees 
were as follows: .知 


Subject 


Test Score 


Job Proficiency 


1 

Xn 

Xf2 

X/3 

X,4 

Yi 

1 

86 

no 

100 J 

87 

88 

2 

62 

97 

99 

100 

80 

3 

no 

107 

103 

103 

96 

23 

104 

73 

93 

80 

78 

24 

94 

121 

115 

104 

115 

25 

91 

129 

97 

83 

83 


a. Prepare separate stem-and-leaf plots of the test scores for each of the four newly developed 
aptitude tests. Are there any noteworthy features in these plots? Comment 

b. Obtain the scatter plot matrix. Also obtain the correlation matrix of theX variables. What do 
the scatter plots suggest about the nature of the functional relationship between the response 
variable Y and each of the predictor variables? Are any serious multicollinearity problems 
evident? Explain. 

c. Fit the multiple regression function containing all four predictor variables as first-order 
terms. Does it appear that all predictor variables should be retained? 

*9.11. Refer to Job proficiency Problem 9.10. 

a. Using only first-order terms for the predictor variables in the pool of potential X variables, 
find the four best subset regression models according to the p criterion. 

b. Since there is relatively little difference in p for the four best subset models, what other 
criteria would you use to help in the selection of the best model? Discuss. 

9.12. Refer to Market share data set in Appendix C.3 and Problem 8.42. 

a. Using only first-order terms for predictor variables, find the three best subset regression 
models according to the SBC P cjiterion. 

b. Is your finding here in-agreement with what you found in Problem 8.42 (b) and (c)? 

9.13 ‘ Lung pressure. Increased arterial blood pressure in the lungs frequently leads to the devel¬ 
opment of heart failure in patients with chronic obstructive pulmonary disease (COPD). The 
standard method for determining arterial lung pressure is invasive, technically difficult, and 
involves some risk to the patient. Radionuclide imaging is a noninvasive, less risky method for 
estimating arterial pressure in the lungs. To investigate the predictive ability of this method, a 
cardiologist collected data on 19 mild-to-moderate COPD patients. The data that follow on the 
next page include the invasive measure of systolic pulmonary arterial pressure (7) and three 



378 Part Two Multiple Linear Regression 


potential noninvusive predicior variables. Two were obtained by using radionuclide imacrjng 
emptying rate of blood into the pumping chamber or ihe hear! (Xj) ynd ejection rate of bloo^ 
pumped out of the heart into the lungs [X 2 ) — mid the third predictor variable measures a bl 
gas (X 3 ). °° d 

a. P re pale separate dot plois for each of the three pi-edicior variables. Are there any notewortb 
feutui'cs in thc.se plois? CoLnincnt. 

b. Obiain the scatter ploi matrix. Also obtuin the coirckition matrix of ihe X variables. V/hatdo 
ihc scatier plots suggest about the iuuure of the functional relationship between Y and each 
of the predictor variables? Are any serious muIticollinearity problems evident? Explain 

c. Fit the multiple regression function containing the three predictor variables as first-order 
terms. Does it appear that all predicior variables should be retained? 

Subject 


/ 

入 n 

^12 

入 , -3 

Y ； 

1 

45 

36 

45 

49 

2 

30 

28 , 

40 

55 

3 

n 

16 " 

42 

85 


Adapted from A. T. Marmor ct al.. Impiovctl Radionuclidt: 

Method Assessment of Pulmonary Anery Pressure 
in COPD：' CIwsi f 1986). rp. 04-69. 

9.14. Refer to Lung pressure Problem 9.13. 

a. Using first-order and second-order terms for each of the three predictor variables (centered 
tiround the mean) in the pool of potential X variables (including cross products of the first- 
order terms), find the three best hierarchical subset regression models according to the 
criterion. 

b. Is there much difference in R 2 , , for the three best subset models? 

a P 

9.15. Kidney function. Oeatinine clearance (7) is an imponant measure of kidney function, but is 
difficult to obtain in a clinical office setting because it requires 24-hour urine collection. To 
determine whether this measure can be predicted from some data that are easily available, a 
kidney specialist obtained the data that follow for 33 male subjects. The predictor variables are 
serum creatinine concentration (X|), age (X 2 ), and weight (X 3 ). 


Subject 


/ 

入 n 

Xu 

入 ,3 

Y t 

1 

.71 

38 

71 

132 

2 

1.48 

78 

69 

53 

3 

2.21 

69 

85 

50 

31 

1.53 

70 

75 

52 

32 

1.58 

63 

62 

73 

33 

1.37 

68 

52 

57 


Adapted from W. J. Sbib fuxl S. Wcisbcrg. ''Assessing InMucncc 
in Multiple Linen)' Regression wish Incomplete Datu/' 

'fvt lntoiin^iivs 28 (l 4 J86). pp. 231 40 


:294031 


: 4 4 6 
■ 453 


.:513240 


:273734 


8 9 
1 — 1 



Chapter 9 Building the Regression Model I; Model Selection and Validation 379 

a. Prepare separate dot plots for each of the three predictor variables. Are there any noteworthy 
features in these plots? Comment. 

b. Obtain the scatter plot matrix. Also obtain the correlation matrix of the X variables. What do 
the scatter plots suggest about the nature of the functional relationship between the response 
variable Y and each predictor variable? Discuss. Are any serious multicollinearity problems 
evident? Explain, 

c. Fit the multiple regression function containing the three predictor variables as first-order 
terms. Does it appear that all predictor variables should be retained? 

9.16, Refer to Kidney function Problem 9.15, 

a. Using first-order and second-order terms for each of the three predictor variables (centered 
around the mean) in the pool of potential X variables (including cross products of the first- 
order terms), find the three best hierarchical subset regression models according to the C p 
criterion, 

b. Is there much difference in C p for the three best subset models? t 

*9,17. Refer to Patient satisfaction Problems 6,15 and 9.9. The Hospital administrator was interested 

to learn how the forward stepwise selection procedure and some of its variations would perform 

here, 

a. Determine the subset of variables that is selected as best by the forward stepwise regression 
procedure, using F limits of 3.0 and 2,9 to add or delete a variable, respectively. Show your 
steps. 

b. To what level of significance in any individual test is the F limit of 3,0 for adding a variable 
approximately equivalent here? 

c. Determine the subset of variables that is selected as best by the forward selection procedure, 
using an F limit of 3,0 to add a variable. Show your steps. 

d. Determine the subset of variables that is selected as best by the backward elimination 
procedure, using an F limit of 2,9 to delete a variable. Show your steps, 

e. Compare the results of the three selection procedures. How consistent are these results? 
How do the results compare with those for all possible regressions in Problem 9.9? 

*9,18 - Refer to Job proficiency Problems 9,10 and 9,11, 

a. Using forward stepwise regression, find the best subset of predictor variables to predict job 
proficiency. Use a limits of .05 and ,10 for adding or deleting a variable, respectively, 

b. How does the best subset according to forward stepwise regression compare with the best 
subset according to the p criterion obtained in Problem 9.11a? 

9.19, Refer to Kidney function Problems 9,15 and 9.16, 

a. Using the same pool of potential X variables as in Problem 9.16a, find the best subset of 

variables according to forward stepwise regression with a limits of ,10 and ,15 to add or 
delete a variable, respectively. ( 

b. How does the best subset according to forward stepwise regression compare with the best 
subset according to the R^ p criterion obtained in Problem 9.16a? 

9.20. Refer to Market share data set in Appendix C.3 anS Problems 8,42 and 9.12, 

a. Using forward stepwise regression, find the best subset of^redictor variables to predict 
market share of their product. Use a limits of .10 and .15 for adding or deleting a predictor, 
respectively, 

b. How does the best subset according to forward stepwise regression compare with the best 
subset according to the SBC P criterion used in 9,12a? 



380 Part Two Multiple Linear Regression 


Exercise 


*9,21. Refer 10 Job proficienty Problems 9.10 and 9.18. To as.se.ss imcrnally the prediclive ability of 
the regression model identitied in Problem 9.1^, compute the PRESS siatistic and conij^n-g it 
to SSE. Whiit docs this compurison suggest nboui the validity of MSE iis an indicaior of the 
predictive ability of the lilted model? 


*9,22. Refer lo Job proficiency Problems 9.10 mid 9.18. To assess externally the validity of the 
regression model identitied in Problem 9.18. 25 uddiiional applicants for cmi*y-|evel cleiical 
posiiions in the ugency wei*c .similarly te.sied und hii-ed irrespective of theii* test scores. The data 
follow. 


Subject 

/ 


Test Score 

^14 

|ob Proficiency 
Score 

-n 

Xn 

X i2 

X,-3 

26 

65 

109 

88 

84 

58 

27 

85 

90 

104 

98 

92 

28 

93 

73 

91 

82 

71 

48 

115 

119 

102 - 

94 

95 

49 

129 

70 

94 

95 

81 

50 

136 

104 

106 

104 

109 


a. Obtain ihe correlation matrix of the X variables for the validation data set and compare it 
with that obtained in Problem 9.10b for the mode 卜 building data set. Are the two coimelation 
matrices reasonably .similar? 

b. Fit the regression model identified in Problem 9.18a to the validation cUita set. Compare the 
estimated regression coefficients and their estimated standard deviations to those obtained 
in Problem 9.18a. Also compare the error mean squares and coefficients of multij^le de¬ 
termination. Do the estimates for the validation data set appear to be reasonably similar to 
those obtuinecl for the model-building data set? 

c. Calculate the mean squared prediction error in (9.20) tmd compare it to MSE obtained fVom 
the mtxiel-building data sei. Is there evidence of a substantial bias problem in MSE here? Is 
this conclusion consistent with your finding in Problem 9.21? Discuss. 

d. Combine the model-building data sei in Problem 9.10 with the validiition data set and fit the 
selected egression model to the combined data. Are the estiniuted stundurd deviations of 
the estimated regression coefficients appreciably reduced now from those obtained for the 
model-building data set? 

9.23. Refer to Lung pressure Problems 9.13 and 9.14. The validity of the regression model identified 

as best in Problem 9.14a is to be assessed internally. 

a. Calculate the PRESS statistic and compai-e it to SSE. What does this comparison suggest 
about the validity of MSE as an indicator of the predictive ability of the litted model? 

b. Case 8 alone accounts for approximately one-halt' of the entii-e PRESS statistic. Would you 
recommend moditicaiion of the model because of the strong impact of this case? Wliai are 
some corrective action options that would lessen the effect of case 8? Discuss. 


9.24 The true quadratic regression function is £：{/} = 15 4 - 20X 4 - 3X~. The fitted linear regression 
function is Y = □+ 401for which £{ 如 } ：= 1 0 and E{b\ } = 45. What avc the bias and sampling 
eiror components of the mean squared error for X ( - = 10 and foi' X, = 20? 



Chapter 9 Building the Regression Model I: Model Selection and Validation 381 


projects 


9.25. Refer to the SENIC data set in Appendix C.l. Length of stay (K) is to be predicted, and the 
pool of potential predictor variables includes all other variables in the data set except medical 
school affiliation and region. It is believed that a model with log 10 Y as the response variable 
and the predictor variables in first-order terms with no interaction terms will be appropriate. 
Consider cases 57-113 to constitute the model-building data set to be used for the following 
analyses. 

a. Prepare separate dot plots for each of the predictor variables. Are there any noteworthy 
features in these plots? Comment. 

b. Obtain the scatter plot matrix. Also obtain the correlation matrix of the X variables. Is there 
evidence of strong linear pairwise associations among the predictor variables here? 

c. Obtain the three best subsets according to the C p criterion, Whichrof these subset models 

appears to have the smallest bias? .■> 

9.26. Refer to the CDI data set in Appendix C.2. A public safety official wishes to predict the rate of 
serious crimes in a CDI (K, total number of serious crimes per 100,000 population), the pool 
of potential predictor variables includes all other variable^ in the data set except total population, 
total serious crimes, county, state, and region. It is believed that a model with predictor variables 
in first-order terms with no interaction terms will be appropriate. Consider the even-numbered 
cases to constitute the model-building data set to be used for the following analyses. 

a. Prepare separate stem-and-leaf plots for each of the predictor variables. Are there any 
noteworthy features in these plots? Comment. 

b. Obtain the scatter plot matrix. Also obtain the correlation matrix of the X variables. Is there 
evidence of strong linear pairwise associations among the predictor variables here? 

c. Using the SBC P criterion, obtain the three best subsets, 

9.27. Refer to the SENIC data set in Appendix C, 1 and Project 9,25. The regression model identified 
as best in Project 9.25 is to be validated by means of the validation data set consisting of 
cases 1-56. 

a. Fit the regression model identified in Project 9.25 as best to the validation data set. Compare 
the estimated regression coefficients and their estimated standard deviations with'those ob¬ 
tained in Project 9.25. Also compare the error mean squares and coefficients of multiple 
determination. Does the model fitted to the validation data set yield similar estimates as the 
model fitted to the model-building data set? 

b. Calculate the mean squared prediction error in (9,20) and compare it to MSE obtained from 
the model-building data set. Is there evidence of a substantial bias problem in MSE here? 

c. Combine the model-building and validation data sets and fit the selected regression model 
to the combined data. Are the estimated regression coefficients and their estimated standard 
deviations appreciably different from those for the model-building data set? Should you 
expect any differences in the estimates? Explain. 


9.28. Refer to the CDI data set in Appendix C,2 and Project 9.26. The regression model identified 
as best in Project 9.26c is toi)e validated by means of the validation data set consisting of the 
odd-numbered CDIs. 

§■ 

a. Fit the regression model identified in Project 9.26 as best to the validation data set. Com¬ 
pare the estimated regression coefficients and their estimated standard deviations with those 
obtained in Project 9.26c. Also cojnpare the error mean squares and coefficients of multi¬ 
ple determination. Does the model fitted to the validation data set yield similar estimates as 
the model fitted to the model-building data set? 



382 Part Two Multiple Linear Regression 


b. 

c. 


Calculate the mean squared prediction error in (9.20) and compare ii to MSli obtained f rom 
the niodcl-building data set* Is there evidence of a substantial bia« problem in MSE here? 

Fit the selected regression model to the combined mode I-building and validation data sets 
Are the estimated regreKsic?ncc?efficienrs and iheir estimated sumtUuxi deviations appreciably 
dil ferenL from ihose for the model tilted to the model-building data set? Should you expect 
any difl'erences in the esiimutes? Explain, 


Case 

Studies 


9.29 - Refer to the Website developer data set in Appendix C,6 - Mamigement is interested in de- 
tennlning what variables have the greatest impiict on production outpui in the s'elease of new 
customer websites. Data on 13 three-person website developmeni teams consisting of a project 
manager, a designer, and a developer are provided in ihe data sei. Product ton data from January 
2001 through August 2002 include four potential predictors; (I) the change in the website de- 
velopment process, (2) the size of the backlog of orders, (3) the team effect, and (4) the number 
of months experience of each team. Develop a besi subsei model for predicting production 
output. Justify your choice of model. Assess your model's ability to predict und discuss its use 
as a tool for management decisions. 

9.30. Refer to the Prostate cancer daia set in Appendix C.5. Serum prostate-specific antigen (PSA) 
was determined in 97 men with advanced prostate cancer. PSA is a well-established sci-eening 
test for prostate cancer and the oncologists wanted to examine the correlation between level 
of PSA and a number of clinical measures for men who were about to undergo radical prosta- 
tectomy. The measures are cancer volume, prostate weight, patient age, the amouni of benign 
pro static hyperplasia, seminal vesicle invasion, capsular penetration, and Gleason score. Select 
a i^indom sample of 65 observations to use as the model-building data set. Develop a best subset 
model for piedicting PSA. Justify your choice of model. Assess your model’s ability to predict 
and discuss its usefulness to the oncologists. 

9.31. Refer to Real estate sales data set in Appendix C.7. Residential sales that occurred during the 
year 2002 were available from a city in the midwest. Datu on 522 anns-length transactions 
include sales price, style, finished square feet, number of bedrooms, pool, lot size, year built, 
ail. conditioning, and whether or not the lot is adjacent to a highway. The city tax assessor was 
interested in predicting sales price based on the demographic variable information given above. 
Select a random sample of 300 observations to use in the model-building data set. Develop a 
best subset in ode I for predicting sales price. Justify your choice of model. Assess your model’s 
ability to predict and discuss Its use as a tool for predicting sales piice. 

9.32. Refer to Prostate cancer Ca.se Study 9.30. The regression model identified in Case Study 9.30 
is to be validated by means of the validation data set consisting of those cases not selected for 
the model-building data set. 

a, Fii the regression model identified in Case Study 9.30 to the validation data set. Compare 
the estimated regression coefficients and their estimated standard errors with those obtained 
in Case Study 9,30. Also compare the error mean square and coefficients of multiple de¬ 
termination, Does the model fitted to the validation data set yield similar estimates as the 
model fitted to the model-building data setV 

b. Calculate the mean squaied prediction error (9.20) and compare it to MSE obtained from 
the mode I-building data set. Is there evidence of u substantial bias problem in MS'E heit? 

9.33. Refer to Real estate sales Case Study 9.31. The regression model identified in Case Study 9,31 
is to be validated by means of the validation data set consisting of those cases not selected for 
the model building data set. 



Chapter 9 Building the Regression Model I: Model Selection and Validation 383 


a. Fit the regression model identified in Case Study 9.31 to the validation data set. Compare 
the estimated regression coefficients and their estimated standard errtx-s with those obtained 
in Case Study 9.31. Also compare the error mean square and coefficients of multiple de¬ 
termination. Does the model fitted to the validation data set yield similar estimates as the 
model fitted to the model-building data set? 

b. Calculate the mean squared prediction error (9.20) and compare it to MSE obtained from 
the model-building data set. Is there evidence of a substantial bias problem in MSE here? 



Chapter 


- - — . __ 

Building the Regression 
Model II ： Diagnostics 


In this chapter we take up a number of refined diagnostics for checking the adequacy of 
a regression model. These include methods for detecting improper functional form for a 
predictor variable ， outliers, influential observations, and multicollinearity. We conclude the 
chapter by illustrating tile use of these diagnostic procedures in the surgical unit example. 
In the following chapter, we take up some remedial measures that are useful when the 
diagnostic procedures indicate model inadequacies. 

10.1 Model Adecjuac> for a Predictor 
Variabl e — Added-Vai iable Plots 


We discussed in Chapters 3 and 6 how a plot of residuals against a predictor variable in 
the regression model can be used to check whether a curvature effect for that variable is 
required in the model. We also described the plotting of residuals against predictor variables 
not yet in the regression model to determine whether it would be helpful to add one or more 
of these variables to the mode]. 

A limitation of these residual plots is that tliey may not properly show the nature of the 
marginal effect of a predictor variable, given the other predictor variables in the model. 
Adclecl-vcuiable plots, also called partial regression plots and adjusted variable plots, are 
refined residual plots that provide graphic information about the marginal importance of a 
predictor variable given the other predictor variables already in the model. In addition, 
these plots can at times be useful for identifying the nature of the marginal relation fora 
predictor variable in the regression model. 

Added-variable plots consider the marginal role of a predictor variable Xa, given that the 
other predictor variables under consideration are already in the model. In an added-variable 
plot, both the response variable Y and the predictor variable 欠人 - under consideration are re¬ 
gressed against the other predictor variables in the regression model and the residuals are 
obtained for each. These residuals reflect the part of each vaiiable that is not linearly asso¬ 
ciated with the other predictor variables already in the regression model. The plot of these 
residuals against each other (1) shows the marginal importance of this variable in reducing 
the residual variability and (2) may provide information about the nature of the margin^ 


384 



Chapter 10 Building the Regression Model II; Diagnostics 385 


FIGURE 10.1 


prototype 



VariaW e 

plots. 


e(Y\X 2 ) 


u 广睡——*^—*** ••画 


0 e{X^\X 2 ) 

(a) 

regression relation for the predictor variable Xk under consideration for possible inclusion t 
in the regression model. ， 

To make these ideas more specific, we consider a first-order multiple regression model 
with two predictor variables X [ and X 2 . The extension to more than two predictor variables is 
direct. Suppose we are concerned about the nature of the regression effect for Xi, given that 
X 2 is already in the model. We regress Y on X 2 and obtain the fitted values and residuals: 

^■(^ 2 ) = ^0 + ^ 2 X 12 (10.1a) 

ei(Y\X 2 ) = Yi - YiiX 2 ) (10.1b) 

The notation here indicates explicitly the response and predictor variables in the fitted 
model. We also regress on X 2 and obtain: 

Xn(X 2 )=b^b* 2 X i2 (10.2a) 

e^X^) = X n - X n (X 2 ) (10.2b) 

The added-variable plot for predictor variable Xi consists of a plot of the Y residuals e(Y\X 2 ) 
against the X { residuals e(Xi\X 2 ). 

Figure 10.1 contains several prototype added-variable plots for our example, where X 2 
is already in the regression model and Xi is under consideration to be added. Figure 10.1a 
shows a horizontal band, indicating that Xi contains no additional information useful for 
predicting Y beyond that contained in X 2 , so that it is not helpful to add to the regression 
model here. 

Figure 10.1b shows a linear band with a nonzero slope. This plot indicates that a linear 
term in X\ may be a helpful addition to the regression model already containing X%. It 
can be shown that the slope of the least squares line through the origin fitted to the plotted 
residuals is h!，the regression coefficient of X\ if this variable were added to the regression 
model already containing X 2 . 

9 

Figure 10.1c shows a curvilinear band, indicating that the addition of Xi to the regression 
model maybe helpful and suggesting the possible nature of the curvature^effect by the pattern 
shown. 

Added-variable plots, in addition to providing information about the possible nature of 
the marginal relationship for a predictor variable, given the other predictor variables already 
in the regression model, also provide information about the strength of this relationship. To 
see how this additional information is provided, consider Figure 10.2. Figure 10.2a illustrates 


e(Y\X 2 ) e(Y\X 2 ) 





386 Part Two Muhipie Linear He^n^s.sian 


FIGURE 10.2 Illustration of Deviations in an Added-Variable Plot. 

(a) Deviations around Zero Line (b) Deviations around Line with Slope ^ 

SSE(X 2 ) - ^[e(V ； |X ( - 2 )] 2 SSE(X V X 2 ) = 入 ,2)] 2 



0 e(X,\X 2 ) 0 e(X,\X 2 ) - 


an added-variable plot for when Xi is already in the model, based on « = 3 cases. The 
vertical deviations of the plotted points around the horizontal line e{Y\X 2 ) — 0 shown in 
Figure 10.2a represent the Y residuals when Xj alone is in the regression model. When 
these deviations are squared and summed, we obtain the error sum of squares SSE(X 2 ). 
Figure 10.2b shows the same plotted points, but here the vertical deviations of these points 
are around the least squares line through the origin with slope b\ . These deviations are the 
residuals e(Y\X\, Xj) when both X\ and Xj are in the regression model. Hence, the sum 
of the squares of these deviations is the error sum of squares SSE(Xi, Xz)- 

The difference between the two sums of squared deviations in Figures 10.2a and 10,2b 
according to (7.1a) is the extra sum of squares SSR(X | |Xi). Hence, the difference in the 
magnitudes of the two sets of deviations provides information about the marginal strength 
of the linear relation of X| to the response variable, given that X 2 is in the model. If the 
scatter of the points around the line through the origin with slope bi is much less than the 
scatter around the horizontal line, inclusion of the variable X| in the regression model will 
provide a substantial further reduction in the error sum of squares. 

Added-variable plots are also useful for uncovering outlying data points that may have a 
strong influence in estimating the relationship of the predictor variable to the response 
variable, given the other predictor variables already in the model. 

Table 10.1 shows a portion of the data on average annual income of managers during the past 
two years (X|), a score measuring each manager's risk aversion (Xt), and the amount of life 
insurance carried (K) for a sample of" 18 managers in the 30-39 age group. Risk aversion 
was measured by a standard questionnaire administered to each manager: the higher the 
score, the greater the degree of risk aversion. Income and risk aversion are mildly correlated 
here, the coefficient of correlation being r \2 = .254. 

A fit of the first-order regression model yields: 

Y = -205.72+ 6.2880X I +4.738X 2 (10.3) 

The residuals for this fitted model are plotted against X\ in Figure 10,3a, This residual 
■plot clearly suggests that a linear relation for X\ is not appropriate in the model already 
containing To obtain more information about the nature of this relationship, we shall 
use an added-variable plot. We regress Y and X\ each against Xj- When doing this, we 



Chapter 10 Building the Regression Model II: Diagnostics 387 


参 10.1 



Manager 


Average Annual 

Income Riisk Aver^ibn 

(thousand dollars) Score 


Amount of tife 
Ihsuranc^ Carried 
Xthousarld ddllairs) 
Y, 


1 

45.di10 

2 

57.204 

3 

•- 3 r •• • • ■ 

26.852 

* • • 

16 

46.130 

17 

30.366 

18 

39.060 


6 


91 


4 162 






ffCURE 10.3 Residua 屋 Plot and Added-Variable Plot ■— Life Insurance Example. 


(a) Residual Plot against X-\ 


(b) Added-Variable Plot for 


25 

20 

15 

10 

卜 • •• 、 

15 c 
3 5 

2 

芑 0 
-5 
-10 
—15 
















m 






% 




20 30 


40 50 60 


70 80 



桃 | x 2 ) 


obtain; 

Y(X 2 ) = 50.70 + 15.54X 2 (10.4a) 

X^) = 40.779 + 1.718X2 (10.4b) 

The residuals from these two fitted jnodels are plotted against each other in the added- 
variable plot in Figure 10.3b. This plot also contains the least squares line through the 
origin, which has slope b\ = 6.2880. The added-variab/e plot suggests that the curvilinear 
relation between Y and when X 2 is already in the regression medel is strongly positive, 
and that a slight concave upward shape may be present. The suggested concavity of the 
relationship is also evident from the vertical deviations around the line through the origin 
with slope b r . These deviations are positive at the left, negative in the middle, and positive 
again at the right. Overall, the deviations from linearity appear to be modest in the range of 
the predictor variables. 





88 Part Two Multiple Linear Ke^rcssum 


Note also that the scatter of the points iirountl the least squares line through the origin 
slope/?[ = 6,2880 is much smaller than is the scatter around the horizontal line £-(KIXt) — q 
indicating that adding X\ to the regression model with a linear relation will substantially 
reduce the error sum of squares. In fact, the coefficient t)f partial determination foi- the linear 
effect of X\ is /?^ l)2 = .984, Incorporating a curvilinear effect for X t will lead to only a 
modest further reduction in the error sum ot' squares since the plotted points are already 
quite close to the linear rekition through the origin with slope by. 

Finally, the added-vmiable plot in Figure 10.3b shows one outlying case, in the upp er 
right earner. The influence of this case needs to be investigated by procedures to be explained 
later in this chapter. 


Example 2 


For the body fat example in Table 7.1 (page 257), we consider here the regression of body fat 
(K) only on triceps skinfold thickness (X [) and thigh circumference (X 2 ) - We omit the third 
predictor variable (X3, midarm circumference) to focus the discussion of added-variable 
plots «n its essentials. Recall thut X t and X 2 are highly correlated (r [2 = .92). The fitted 
regression function was obtained in Table 7.2c (page 258 )： 


Y = -19.174 + .2224 X, + .6594X 2 


Figures 10.4a and 10.4c contain plots of the residuals against X y and X：, respectively. 
These plots do not indicate any lack of fit for the linear terms in the regression model or the 
existence of unequal variances of the error terms. 

Figures 10.4b and 10.4d contain the added-variable plots for X [ and Xi, respectively, 
when the other predictor variable is already in the regression model. Both plots also show 
the line through the origin with slope equal to the regression coefficient for the predictor 
variable if it were added to the fitted model. These two plots provide some useful additional 
information. The scatter in Figure 10.4b follows the prototype in Figure 10, la, suggesting 
that X [ is of little additional help in the model when Xi is already present. This information 
is not provided by the regular residual plot in Figure 10,4a. The fact that X) appears to be 
of little marginal help when X2 is already in the regression model is in accord with earlier 
findings in Chapter 7. We saw there that the coefficient of partial determination is only 
/?^ 2 = .031 and that the statistic for b[ is only .73. 

The added-variable plot for in Figure I0.4d follows the prototype in Figure 10.1b, 
showing a linear scatter with positive slope. We also see in Figure I0.4d that there is 
somewhat less variability around the line with slope bj than around the horizontal line 
^(KlXi) = 0. This suggests that: (1) variable X 2 may be helpful in the regression model 
even when X [ is already in the model, and (2) a linear term in X 2 appears to be adequate 
because no curvilinear relation is suggested by the scatter of points. Thus, the added- 
variable plot for X 2 in Figure 10.4d complements the regular residual plot in Figure 10.4c 
by indicating the potential usefulness of thigh circumference (X2) in the regression model 
when triceps skinfold thickness (X t ) is already in the model. This information is consistent 
with the f statistic for b 2 of 2.26 in Table 7.2c and the moderate coefficient of partial 
determination of Ry 2 \\ — 232, Finally, the added-variable plot in Figure I0.4d reveals the 
presence of one potentially influential case (case 3) in the lower left corner. The influence 
of this case will be investigated in greater detail in Section 10.4. 



Chapter 1 0 Building the Regression Model II ： Diagnostics 389 


FIGURE 10.4 
^idunl Plots 

andAdded- 

VariaWe 

Hots—Body 
Fat Example 

ffithTwo 

predictor 

Variables. 


40 45 50 55 

入 2 


Comments 

1. An added-variable plot only suggests the nature of the functional relation in which a predictor 
variable should be added to the regression model but does not provide an analytic expression of the 
relation. Furthermore, the relation shown g for Xk adjusted for the other predictor variables in the 
regression model，not for directly. Hence, a variety of transformations or curvature effect terms 
may need to be investigated and additional residual plots utilized to identify the best transformation 
or curvature effect terms. 

2, Added-variable plots need to be used with caution for identifying the nature of the maiginal 
effect of a predictor variable. These plots may not show the proper form of the marginal effect of a 
predictor variable if the functional relations for some or all of the predictor variables already in the 
regression model are misspecified. For example, if X 2 and X 3 are related in a curvilinear fashion to 
the response variable but the regression model uses linear terms only, the added-variable plots for X 2 


- 


& 



$ 

- 








金 參 

& 


令 - 

$ 

一 


$ ❹ 

- 

& 

$ 



$ ^ 



$ $ 

— 

_ L 

$ 

_ 1 _ \ _ 1 





(b) Added-Variable Plot for 



(d) Added-Variable Plot for X 2 


(a) Residual Plot against 


一 

A $ 

9 






m 0 
& 

- 

& $ 

一 

$ 

1 

® — 

_ t _ i _1_1 


0 15 


20 25 


30 35 


2 


5 

4 

3 

2 

1 

0 


■mi. 

-3 
-4 
-5 
一 6 

— i 


(c) Residual Plot against X 2 


c 



len-say 


ienp-ail 







390 Part Two Multiple Linear Reye 5 su)n 


and X 3 may not show the proper relationships to the response variable, especially when the predict 
variables are correlated. Since ackled-variable plots for the seveml predictor vm’mbles are all concerned 
with marginal effects only, they may therefore not be effective when the relations of the predictor van 
ables to the response variable are complex. Also, added-variable plots may not detect interaction effect 
that are present. Finally, high malticollinearity among the predictor variables may cause the added 
variable plots to show an improper functional relation for the marginal effect of a predictor variable. 

3. When several added-variable plots are required for a set of predictor variables, it is not nec 
essaiy to fit entirely new regression models each time. Computational procedures aie available that 
economize on the calculations required: these are explained in specialized texts such as Reference 10 1 

4 - Any fitted multiple regression function can be obtained from a sequence of fitted partial regres¬ 
sions. To illustrate this, consider again the life insurance example, where the fitted regression of y^j 
X 2 is given in (10.4a) and the fitted regression of on X 2 is given in (10.4b), If we now repress the 
residuals = K — K(X 2 ) on the residuals |Xt) = X\ -- X | (X^), using regression through 

the origin, we obtain (calculations not shown): 


eiYIX,) = 6.2880re(X,|X 2 )l 

By simple substitution, using (10.4a) and (10.4b). we obtain: 

[Y - (50.70+ I5.54X,)) = 6,2880[X| - (40.779 + I/7I8X 2 )] 


00.5) 


Y = -205.72 + 6.2880X, 4 - 4.737X 2 (10.6) 

where the solution for Y is the fitted value Y when X\ and X 2 are included in the regression model. 
Note that the fitted regression function in (10.6) is the same as when the regression model was fitted 
to and Xj directly in (10.3), except for a minor difference due to rounding effects. 

5. A residual plot closely i-elated to the added-variable plot is the partial residual plot. This plot 
also is used as an aid for identifying the nature of the relationship for a predictor variable under 
consideration for addition to the regression model. The partial residual plot takes as the starting point 
the usual residuals e ； = F, — Y, when the model including X k is fitted, to which the regression effect 
for X k is added. Specifically, the partial residuals for examining the effect of pi-edictor viirUtble X k , 
denoted by are defined as follows: 

Pi(^k) ~ C ，+ bk^ik (10*7) 

Thus, for a partial residual, we add the effect of X k , as reflected by the fitted model term b k X ik , bad 
onto the residual. A plot of these partial residuals against X k is I'efeired to as a paitial residual plot 
The reader is referred to References 10.2 and 10.3 for more details on partial residual plots. 鼸 


10.2 Identilyiug Outlying ) Observations —— Studentized 
_Deleted Residuals__ 

Outlying Cases 

Frequently in regression analysis applications, the data set contains some cases that are out- 
, lying or extreme; that is, the observations for these cases are well separated from the 
remainder of the data. These outlying cases may involve large residuals and often have 
dramatic effects on the fitted least squares regression function. It is therefore important tc 



Chapter 10 Building the Regression Model II： Diagnostics 391 


FIGURE 10.5 

Scatter Plot for 

Regression 

with One 

predictor 

friable 

Illustrating 

0娜邱 

Cases. 


Y 


» 


& 






^ A 


® ® ® 
m 


❿金 
$ ® ® 

m 


— 2 


4 — — 3 


X 


study the outlying cases carefully and decide whether they should be retained orelinjinated, 
and if retained, whether their influence should be reduced in the fitting process and/or the 
regression model should be revised. 

A case may be outlying or extreme with respect to its Y value, its X value(s), or both. 

穿 Figure 10.5 illustrates this for the case of regression with a single predictor variable. In the 
scatter plot in Figure 10.5, case 1 is outlying with respect to its Y value, given X. Note that 
this point falls far outside the scatter, although its X value is near the middle of the range of 
observations on the predictor variable. Cases 2, 3, and 4 are outlying with respect to their 
X values since they have much larger X values than those for the other cases; cases 3 and 
4 are also outlying with respect to their Y values, given X. 

Not all outlying cases have a strong influence on the fitted regression function. Case 1 
in Figure 10.5 may not be too influential because a number of other cases have similar 
X values that will keep the fitted regression function from being displaced too far by the 
outlying case. Likewise, case 2 may not be too influential because its Y value is consistent 
with the regression relation displayed by the nonextreme cases. Cases 3 and 4, orr the other 
hand, are likely to be very influential in affecting the fit of the regression function. They 
are outlying with regard to their X values, and their Y values are not consistent with the 
regression relation for the other cases. 

A basic step in any regression analysis is to determine if the regression model under 
consideration is heavily influenced by one or a few cases in the data set. For regression 
with one or two predictor variables, it is relatively simple to identify outlying cases with 
respect to their X or F values by means of box plots, stem-and-leaf plots, scatter plots, and 
residual plots, and to study whether they are influential in affecting the fitted regression 
function. When more than two predictor variables are included in the regression model, 
however, the identification of outlying cases by simple graphic means becomes difficult 
because single-variable or two-variable examinations do not necessarily help find outliers 
relative to a multivariable regression model. Some univariate outliers may not be extreme 
in a multiple regression model, and, conversely, some multivariable outliers may not be 
detectable in single-variable or two-variable analyses. * 

We now discuss the use of some refiijed measures for identifying cases with outlying 
y observations. In the following section we take up the identification of cases that are 
multivariable outliers with respect to their X values. 



392 Part Two Multiple Linear Regression 


Residuals and Semistudentized Residuals 

The detection of outlying or extreme Y observations based on an examination of the residue 
has been considered in earlier chapters. We utilized there either the residLial e, \ 

ei = Y < ~ ( 10 . 8 ) 

or the semistudentized residuals ej ： 

e; — Jmse ( 10 . 9 ) 

We introduce now two refinements to make the analysis of residuals more effective f or 
identifying outlying Y observations. These refinements require the use of the hat matrix, 
which we encountered in Chapters 5 and 6. 

Hat Matrix 

The hat matrix was defined in (6.30a): .‘ 

H^XiXXr'x 1 (10.10) 

We noted in (6.30) that the fitted values ?i can be expressed as linear combinations of the 
observations Y； through the hat matrix; 

Y = HY (10.11) 

and similarly we noted in (6.31) that the residuals e, can also be expressed as linear com¬ 
binations of the observations Y, by means of the hat matrix: 

e = (I - H)Y (10.12) 

Further, we noted in (6.32) that the variance-covariance matrix of the residuals involves 
the hat matrix: 

(T 2 {e} = a 2 (I-H) (10.13) 

Therefore, the variance of residual e；, denoted by o 2 {e；}^ is*. 

(J 2 {e,} = cr 2 (l — ha) (10.14) 

where is the /th element on the main diagonal of the hat matrix, and the covariance 
between residuals e ； and ej (i ^ j) is; 

o{ei , <?；} = a 2 (0-h ；j ) = -hijo 2 i ^ j (10.15) 

where h" is the element in the ith row and jth column of the hat matrix. 

These variances and covariances are estimated by using MSE as the estimator of the error 
variance a 7 : 

s 2 {e { } = MSE( \ - h u ) (10.16a) 

s{ ei ,e } } = -h u {MSE) \+ i (10.16b) 

We shall illustrate these different roles of the hat matrix by an example. 



Chapter 10 Building the Regression Model II: Diagnostics 393 


(4) 

(5) 


均 

282.2 

18.8 

332.3 


260.0 

-14.0 

186.5 

.5 


(a) Data and Basic Results 


(7) 

s 2 {e f } 

352.0 

28.0 

194.6 

•2 


2 

3 

4 


|S*o f 

0atlVl atrix * 


(b) H 


(c) s 2 {e} 


".3877 

.1727 

.4553 

-.0157' 


~ 352.0 

—99.3 

—.8 

9.0' 

.1727 

.9513 

-.1284 

.0044; 


-99,3 

28；0 

歹 3.8 

JL 

— 2.5 

.4553 

—.1284 

.6614 

.0117 


-261.8 

73.8 

194.6 

-6.7 

—.0157 

.0044 

■0117 

.9996_ 


9.0 

-2.5 

-6.7 

.2 - 


% 


Example 


A small data set based on n = 4 cases for examining the regression relation between 
a response variable Y and two predictor variables X\ and X 2 is shown in Table 10.2a, 
columns 1-3. The fitted first-order model and the error mean square are: 


Y = 80.93 - 5.84X, + \ l.32X 2 
MSE = 574.9 


(10.17) 


1 

14 

25 

1 

19 

32 

1 

12 

22 

1 

11 

15 


The fitted values and the residuals for the four cases are shown in columns 4 and 5 of 
Table 10.2a. 

The hat matrix for these data is shown in Table 10.2b. It was obtained by means of (10.10) 
for the X matrix ： 


X = 


Note from (10.10) that the hat matrix is solely a function of the predictor variable(s). Also 
note from Table 10.2b that the hat matrix is symmetric. The diagonal elements ha of the 
hat matrix are repeated in column 6 of Table 10.2a. 

We illustrate that the fitted values are linear combinations of the Y values by calculating 
Y i by means of (10.11): 

i 

Y ] = ^ 11 +^ 12^2 + ^- 13^3 + ^ 14^4 

=.3877(301) + .1727(327) + .4553(246) - .0157(187) 

- 282.2 

• ■&- 

This is the same result, except for possible rounding effects, as obtained from the fitted 
regression function (10.17): 


F, = 80.93 —5.84(14) + 11.32(25) = 282.2 


f 17 6 7 
3 0 2 4 8 
y-v : 3 3 2 -1 

^ « 5 2 2 5 
(2x '2 3 2 1 

V n 4 9 2 1^ 
^ii ^ ^n 


7 3 4 6 

«« 7 11 

c6 ^00 5 ,6 9 

X, 3C 9 6 9 
•••• 



94 Part Two Multiple Lincuv Regression 


The estimated variance-covariance matrix of the residuals, s~{e} =M5£(I — H),i s 也⑽ 
in Table 10.2c. It was obtained by using MSE = 574.9. The estimated variances of th 
residuals are shown in the main diagonal of the variance-covariance matrix in Table 102c 
and are repeated in column 7 of Table 10.2a, We illustrate their direct calculation for case 1 
by using (10.16a): 


= 574.9(1 - .3877) = 352.0 


We see from Table 10.2a, column 7, that the residuals do not have constant variance 
In fact, the variances differ greatly here because the data set is so small. As we shall note 
in Section 10.3, residuals for cases that are outlying with respect to the X variables have 
smaller variances. 

Note also that the covariances in the matrix in Table 10.2c are not zero; hence, pairs of 
residuals are correlated, some positively and some negatively. We noted this correlation in 
Chapter 3, but also pointed out there that the correlations become very small for larger data 
sets. 


Comment 

The diagonal element /i,, of the hat matrix can be obtained directly from: 

h u = XkX^Xj-'X,- (10,18) 

where： 


X r = 

/JX I 



(10.18a) 


LU 

Note that X, corresponds to the X„ vector in (6.53) except that X, pei tains to the i th case, and thatXf 
is simply the i th row of the X matrix, pertaining to the i th case. I 


Studentized Residuals 

The first refinement in making residuals more effective for detecting outlying Y observa- 
tions involves recognition of the fact that the residuals e-, may have substantially differoit 
variances cr 2 {e；}. It is therefore appropriate to consider the magnitude of each e, relative to 
its estimated standard deviation to give recognition to differences in the sampling errors of 
the residuals. We see from (10,16a) that an estimator of the standard deviation of e, is： 

= s/MSEiX^h^) ( 10 . 19 ) 


The ratio of to } is called the stuclentizec! residual and will be denoted by r, : 


g ； 

♦;} 


( 10 . 20 ) 


While the residuals e\ will have substantially different sampling variations if their standard 
deviations diftei' markedly, the studentized residuals r, have constant variance (when the 
model is appropriate). Studentized residuals often are cal led intemaliy studentized residuals. 



Chapter 10 Building the Regression Model II: Diagnostics 395 


p e leted Residuals 

The second refinement to make residuals more effective for detecting outlying Y observa- 
tions is to measure the ith residual e t — Y t ~Y\ when the fitted regression is based on all 
of the cases except the ith one. The reason for this refinement is that if F； is far outlying, 
the fitted least squares regression function based on all cases including the ith one may be 
influenced to come close to Yi, yielding a fitted value fnear F 卜 In that event, the residual 
e! will be small and will not disclose that is outlying. On the other hand, if the /th case 
is excluded before the regression function is fitted, the least squares fitted value Y t is not 
influenced by the outlying F； observation, and the residual for the ith case will then tend to 
be larger and therefore more likely to disclose the outlying Y observation. 

The procedure then is to delete the ith case, fit the regression function to the remaining 
n~ \ cases, and obtain the point estimate of the expected value when 5 the X levels are those 
of the ith case, to be denoted by The difference between the actual observed value Yi 
and the estimated expected value F I(I ) will be denoted by d- t : ^ 

di ~Yi~ Yni) ( 10 . 21 ) 


The difference d t is called the deleted residual for the ith case. We encountered this same 
difference in (9.16), where it was called the PRESS prediction error for the ith case. 

An algebraically equivalent expression for di that does not require a recomputation of 
the fitted regression function omitting the ith case is: 


d t ~ 




hn 


(10.21a) 


where e t is the ordinary residual for the ith case and ha is the i th diagonal element in the 
hat matrix, as given in (10.18). Note that the larger is the value ha ，the larger will be the 
deleted residual as compared to the ordinary residual. 

Thus, deleted residuals will at times identify outlying Y observations when ordinary 
residuals would not identify these; at other times deleted residuals lead to the same identi¬ 
fications as ordinary residuals. 

Note that a deleted residual also corresponds to the prediction error for a new observation 
in the numerator of (2.35). There, we are predicting a new n-{-l observation from the fitted 
regression function based on the earlier n cases. Modifying the earlier notation for the 
context of deleted residuals, where n — l cases are used for predicting the “new” nth case, 
we can restate the result in (6.63a) to obtain the estimated variance of d { : 


s 2 {d i }=MSE [i) (l-^X , i (^ (i) X ii) y l X i ) 


( 10 . 22 ) 


where X,- is the X observations vector (*10.18a) for the / th case, MSE^) is the mean square 
error when the ith case is omitted in fitting the regression function, and X(, } is the X matrix 
with the ith case deleted. An algebraically equivalent expression for s 2 [d- t } is: 


is 2 {di}= 


MSE t 


(0 


1 - hn 


(10.22a) 


It follows from (6.63) that; 


di 


啦 } 


t(n — p — 1) 


(10.23) 



396 Part Two Multiple Linear Re^res.sion 


SSE(i 



Remember that n — 1 cases are used here in predicting the ith observation； hence the 
degrees of freedom are (n — \ ) — p = n — p — 1. 


Studentized Deleted Residuals 


Combining the above two refinements, we utilize for diagnosis of outlying or extreme 
Y observations the deleted residual cli in (10.21) and studentize it by dividing it by it s 
estimated standard deviation given by (10.22). The studentized deleted residual, denoted 
by ti, therefore is: 

(10.24) 




It follows from (10.21a) and (10.22a) that an algebraically equivalent expression for is- 

h= ^MSE {i) (\ - (10 . 24a ) 

The studentized deleted residual in (10.24) is also called an externally studentized 
residual, in contrast to the internally studentized residual r, in (10.20). We know from(l 0.23) 
that each studentized deleted residual follows the t distribution with n — p — \ degrees 
of freedom. The however, are not independent. 

Fortunately, the studentized deleted residuals in (10.24) can be calculated without 
having to fit new regression functions each time a different case is omitted. A simple 
relationship exists between MSE and MSE {i) : 


(n — p)MSE — (n — p — 1)A4SE “) 


? 

er 


i — /i". 


(10.25) 


Using this relationship in (10.24a) yields the following equivalent expression for r ( : 

i/2 


ti — 


(10.26) 


Example 


Thus, the studentized deleted residuals t { can be calculated from the residuals e it the error 
sum of squares SSE, and the hat matrix values h it , all for the fitted regression based on the 
/? cases. 

Test for Outliers. We identify as outlying Y observations those cases whose studentized 
deleted residuals are large in absolute value. In addition, we can conduct a formal test 
by means of the Bonferroni test procedure of whether the case with the largest absolute 
studentized deleted residual is an outlier. Since we do not know in advance which case will 
have the largest absolute value |f, |, we consider the family of tests to include n tests, one 
for each case. If the regression model is appropriate, so that no case is outlying because of 
a change in the model, then each studentized deleted residual will follow the t distribution 
with n — p — \ degrees of freedom. The appropriate Bonferroni critical value therefore is 
f (1 — a/2n \ n — p — 1). Note that the test is two-sided since we are not concerned with the 
direction of the residuals but only with their absolute values. 

For the body fat example with two predictor variables (Xj, X 2 ), we wish to examine 
whether there are outlying Y observations. Table 10.3 presents the residuals e, in column 1, 


2 -f 

e 





Chapter 10 Building the Regression Model II: Diagnostics 397 


TABLE 103 

gesidu 3 ^ 
piagonal 
Elements of the 
flat Matrix, 
and 

gtudentized 

Deleted 
Residuals— 
Body Fat 

Example with 
X(vo Predictor 
Variables. 


(3) 

h 

-.730 
1.534 
—1.656 
-1.348 
•000 
-.148 
.298 
1.760 
1.117 
-1.034 
.137 
.923 
-1.825 
1.524 
.267 
• 2&8 
.344 
335 
-1.176 
.409 


:今 

J 


i 


( 1 ) 

ei 

-1.633 

3.643 

-3.176 

-3.158 

•000 

-.361 

.716 

4:015 

2.655 

-2.475 

.336 

2.226 

-3.947 

3.447 

.571 

.642 

-.851 

—.783 

-2.857 

1.040 


the diagonal elements ha of the hat matrix in column 2, and the studentized deleted residuals 
ti in column 3. We illustrate the calculation of the studentized deleted residual for the first 
case. The X values for this case, given in Table 7.1, are Xu = 19.5 and X 12 = 43.1. Using 
the fitted regression function from Table 7.2c, we obtain: 

= -19.174+ .2224(19.5) + .6594(43.1) = 13.583 

Since Yj = 11.9, the residual for this case is e\ = 11.9 — 13.583= — 1.683. We also know 
from Table 7.2c that SSE = 109.95 and from Table 10.3 that/zn = .201. Hence, by (10.26), 
we find: 


t y -1.683 


20-3-1 


109.95(1 - .201) - (-1.683) 2 


1/2 


.730 


Note from Table 10.3, column 3, that cases 3,8, and 13 have the largest absolute studen¬ 
tized deleted residuals. Incidentally, consideration of the residuals e,- (shown in Table 10.3, 
column 1) here would have identified cases 2, 8, and 13 as the most outlying ones, but not 
case 3. • 

We would like to test whether case 13, which has the largest absolute studentized 
deleted residual, is an outlier resulting from a change iiT the model. We shall use the 
Bonferroni simultaneous test procedure with a family significance level of or = .10. We 
therefore require: 


12 3 4 $ 6 


00 9°' 1 w' 2 3 4 5 6 7 8 ov 20 


)r 19218966 5 0 o 9,8 8 3 5 6 7 7 o 
2 0 5 7 14 2 5 9 1 12 o ^7 3 ^ 0 9 6 5 
^_V n n n n n Ti -^1— o tr** r"** o o 


t{\ - a/2n; n~p~\) = r(.9975; 16) = 3.252 



398 Part Two Multiple Linear Regression 


Since |fi 3 | = 1.825 < 3.252, we conclude that case 13 is not an outlier. Still, we , 
wish to investigate whether case 13 and perhaps a few other outlying cases are influential 
in determining the fitted regression function because the Bonferroni procedure provides 
very conservative test for the presence of an outlier. 


10.3 Ideiitifx ing Outlying X Observations — l lal Matrix 
Leverage Values 

s -- - 

Use of Hat Matrix for Identifying Outlying X Observations 

The hat matrix, as we saw, plays an important role in determining the magnitude of a 
studentized deleted residual and therefore in identifying outlying Y observations. The hat 
matrix also is helpful in directly identifying outlying X observations. In particular, the 
diagonal elements of the hat matrix are a useful indicator in a multivariable setting of 
whether or not a case is outlying with respect to its X values. 

The diagonal elements h u of the hat matrix have some useful properties. In particular 
their values are always between 0 and 1 and their sum is p: 

ll 

0 < hu < 1 hu = P ( 10 . 27 ) 

i=\ 

where p is the number of regression parameters in the regression function including the 
intercept term. In addition, it can be shown that ha is a measure of the distance between 
the X values for the i th case and the means of the X values for all /? cases. Thus, a lai^e value 
hu indicates that the / th case is distant from the center of all X observations. The diagonal 
element /?,, in this context is called the leverage (in terms of the X values) of the / th case. 

Figure 10.6 illustrates the role of the leverage values hj ； as distance measures for our 
earlier example in Table 10.2. Figure 10.6 shows a scatter plot of X 2 against X| for the 
four cases, and the center of the four cases located at (Xi, X 2 ). This center is called 
the centroid. Here, the centroid is (X[ = 14.0, X 2 = 23.5). In addition, Figure 10.6 shows 
the leverage value for each case. Note that cases 1 and 3, which are closest to the centroid, 
have the smallest leverage values, while cases 2 and 4, which are farthest from the center, 
have the largest leverage values. Note also that the four leverage values sum to p = 3. 


FIGURE 10.6 
Illustration of 
Leverage 
Values es 
Distance 
Measures — 
Table 10.2 
Example. 



Chapter 10 Building the Regression Model II: Diagnostics 399 


If the ith case is outlying in terms of its X observations and therefore has a laige leverage 
value ha, it exercises substantial leverage in determining the fitted value F,-. This is so for 
the following reasons: 

八 

1. The fitted value F f is a linear combination of the observed Y values, as shown 
by (10.11)，and ha is the weight of observation Yi in determining this fitted value. Thus, the 
laiger is hu, the more important is Y t in determining Remember that ha is a function 
only of the X values, so ha measures the role of the X values in determining how important 
Yt is in affecting the fitted value Y it 

2. The larger is h“, the smaller is the variance of the residual e,-, as we noted earlier 
from (10.14). Hence, the larger is ha, the closer the fitted value f 老 will tend to be to the 
observed value Y ir In the extreme case where ha = 1， the variance & 2 [ei} equals 0, so the 
fitted value Y { is then forced to equal the observed value Y t . 

i 

A leverage value ha is usually considered to be large, if it is more than twice as laige as 
the mean leverage value, denoted by h, which according to (10.27) is: 

h= hii = E ( 10 . 28 ) 

y n n 

Hence, leverage values greater than 2p/n are considered by this rule to indicate outlying 
cases with regard to their X values. Another suggested guideline is that ha values exceeding 
.5 indicate very high leverage, whereas those between .2 and .5 indicate moderate leverage. 
Additional evidence of an outlying case is the existence of a gap between the leverage values 
for most of the cases and the unusually large leverage value(s). 

The rules just mentioned for identifying cases that are outlying with respect to their 
X values are intended for data sets that are reasonably large, relative to the number of 
parameters in the regression function. They are not applicable, for instance, to the simple 
example in Table 10.2 where there are n = 4 cases and p = 3 parameters in the regression 
function. Here, the mean leverage value is 3/4 = .75, and one cannot obtain a leverage 
value twice as large as the mean value since leverage values cannot exceed 1.0. 


Example 


We continue with the body fat example of Table 7.1. We again use only the two predictor 
variables'triceps skinfold thickness (X t ) and thigh circumference (X 2 ) so that the results 
using the hat matrix can be compared to simple graphic plots. Figure 10.7 contains a scatter 


FIGURE 10.7 
Scatter Hot 
of Thigh 
Circumference 
against Triceps 
Skinfold 
TWckness — 
Body Fat 
Example with 
Tffo Predictor 
Variables. 



Triceps Skinfold Thickness 




00 Part Two Multiple Linear Regression 


plot of Xi against X i, where the data points are identified by their case number. We note 
Figure 10,7 that cases 15 and 3 appear to be outlying ones with respect to the pattern of the 
X values. Case 15 is outlying for X [ and at the low end of the range for X 2 , whereas case 3 
is outlying in terms of the pattern of multicollinearity, though it is not outlying f Q r either of 
the predictor variables separately. Cases I and 5 also appear to be somewhat extreme 
Table 10.3, column 2, contains the leverage values hi ； for the body fat example. Note that 
the two largest leverage values are = .372 mdh i5j5 = .333. Both exceed the criterion 
of twice the mean leverage value, 2p/n — 2(3)/20 = .30, and both are separated by a 
substantial gap from the next largest leverage values, = .248 and h u = .201. Having 
identified cases 3 and 15 as outlying in terms of their X values, we shall need to ascertain 
how influential these cases are in the fitting of the regression function. 


Use of Hat Matrix to Identify Hidden Extrapolation 

We have seen that the hat matrix is useful in the model-building stage for identifying cases 
that are outlying with respect to their X values ancl, that, therefore, may be influential in 
affecting the fitted model. The hat matrix is also useful after the model has been selected 
and fitted for determining whether an inference for a mean response or a new observation 
involves a substantial extrapolation beyond the range of the data. When there are only two 
predictor variables, it is easy to see from a scatter plot of X 2 against Xi whether an inference 
for a particular (X 1? X 2 ) set of values is outlying beyond the range of the data, such as from 
Figure 10,7. This simple graphic analysis is no longer available with lai^er numbers of 
predictor variables, where extrapolations may be hidden. 

To spot hidden extrapolations, we can utilize the direct leverage calculation in (10.18) 
for the new set of X values for which inferences are to be made: 


h 


new, new 


= x ncVJ {x'xr { x mw 


(10.29) 


where X ncw is the vector containing the X values for which an inference about a mean 
response or a new observation is to be made, and the X matrix is the one based on the data 
set used for fitting the regression model. If new.new is well within the range of leverage 
values ha foi_ the cases in the data set, no extrapolation is involved. On the other hand, if 
^ncw.new is much larger than the leverage values for the cases in the data set, an extrapolation 
is indicated. 


10.4 Identifying Influential Cases — DFFITS, Cook’s Distance, 
and DFBETAS Measures 

After identifying cases that are outlying with respect to their Y values and/or their X 
values, the next step is to ascertain whether or not these outlying cases are influential. We 
shall consider a case to be influential if Us exclusion causes major changes in the fitted 
regression function. As noted in Figure 10.5, not all outlying cases need be influential. For 
example, case 1 in Figure 10.5 may not affect the fitted regression function to any substantial 
extent. 

We take up three measures of influence that are widely used in practice, each based 
the omission of a single case to measure its influence. 



Chapter 10 Building the Regression Model II: Diagnostics 401 


Influence on Single Fitted Value 一 DFFITS 

A useful measure of the influence that case i has on the fitted value f,- is given by: 


(DFFITS)i = 


Yj - Y m 
jMSE {i) h u 


(10.30) 


_ 八 

The letters DF stand for the difference between the fitted value Yi for the/th case when all n 

A. 

cases are used in fitting the regression function and the predicted value for the / th case 
obtained when the /th case is omitted in fitting the regression function. The denominator 
of (10.30) is the estimated standard deviation of K,-, but it uses the error mean square when 
the ith case is omitted in fitting the regression function for estimating the error variance a 2 . 
The denominator provides a standardization so that the value (DFFITS) ( for the ith case 

"-j y\ 

represents the number of estimated standard deviations of F ； that the fitted value F,- increases 
or decreases with the inclusion of the ith case in fitting the regression model. , 

It can be shown that the DFFITS values can be computed by using only the resulte from 
fitting the entire data set, as follows: 


(_DFFITS)i = ei 


n — p — 1 
SSE(1 - h a ) - ef 


1 - h u 


1 - h u 


(10.30a) 


Note from the last expression that the DFFITS value for the ith case is a studentized deleted 
residual, as given in (10.26), increased or decreased by a factor that is a function of the 
leverage value for this case. If case i is an X outlier and has a high leverage value, this 
factor will be greater than 1 and (DFFITS)i will tend to be large absolutely. 

As a guideline for identifying influential cases, we suggest considering a case influential 
if the absolute value of DFFITS exceeds 1 for small to medium data sets and 2^/ p/n for 
large data sets. 


Example 


Table 10.4, column 1, lists the DFFITS values for the body fat example with two predictor 
variables. To illustrate the calculations, consider the DFFITS value for case 3, which was 
identified as outlying with respect to its X values. From Table 10.3, we know that the 
studentized deleted residual for this case is = — 1.656 and the leverage value is ^33 = .372. 
Hence, using (10.30a) we obtain: 


(DFFITSh = -1.656 


.372 \ 1/2 
1 - .372/ 


= -1.27 


The only DFFITS value in Table 10.4 that exceeds our guideline for a medium-size 
data set is for case 3, where |(DFF/r 5 , ) 3 | = 1.273. This value is somewhat larger than our 
guideline of 1. However, the vali^ is close enough to 1 that the case may not be influential 
enough to require remedial action. 


Comment - ^ 

The estimated variance of used in the denominator of (10.30) is developed from the relation 
Y = HY in (10.11). Using (5.46), we obtain： 

(^{Y} = Ho 2 {Y}H , = H(a 2 I)H’ 



402 Part Two Multiple Linear Regression 


(4) 

DFBETAS 


(5) 


6 ! 




b 2 


-.132 

.232 

.115 

-.143 

1.183 

1.067 

-.294 

.196 

.000 

.000 

.040 

-.044 

-.016 

•054 

.391 

-.3 轉 

-.295 

.247 

.245 

— .269 

.017 

一， 003 

•023 

.070 

.592 

-.390 

.113 

-.298 

-.125 

.069 

.043 

-.025 

•055 

-.076 

.075 

-.116 

-.004 

.064 

.002 

一 .003 


TABLE 10.4 

DFFITS ， 

Cook’s 

Distances, and 

* 

(1) 

{DFFITS) 

DFBETAS— 

1 

一 .366 

Body Fat 

2 

.384 

Example with 

3 

一 1.273 

Two Predictor 

4 

-.476 

Variables. 

5 

•000 


6 

-.057 


7 

.128 


8 

.575 


9 

.402 


10 

-.364 


11 

.051 


12 

.323 


13 

一 .851 


14 

.636 


15 

.189 


16 

.084 


17 

— .118 


18 

-.166 


19 

一 .315 


20 

.094 


Since His a symmetric matrix，so H = H ， and it is alsoidempotent, so 

u 2 {Y} = (7 2 H 

Hence, the variance of Y t is: 

a 2 的二八 


:H，we obtain: 


(10.31) 


(10.32) 


where ha is the /th diagonal element of the hat matrix. The error term variance a 2 is estimated 
in (10,30) by the error mean square obtained when the/th case is omitted in fitting the regression 
model. I 

Influence on All Fitted Values — Cook's Distance 

In contrast to the DFF1TS measure in (10.30), which considers the influence of the ith case 
on the fitted value F,- for this case. Cook’s distance measure considers the influence of 
the ith case on all n fitted values. Cook’s distance measure, denoted by D t , is an aggregate 
influence measure, showing the effect of the /th case on all n fitted values: 


A 


E"=, (Yj - Y ni) y 


(10.33) 


pMSE 

Note that the numerator involves similar differences as in the DFF1TS measure, but here 
each of then fitted values fj is compared with the corresponding fitted value Yj^) when the 
ith case is deleted in fitting the regression model. These differences are then squared and 
summed，so that the aggregate influence of the ith case is measured without regard to the 
signs of the effects. Finally, the denominator serves as a standardizing measure. In matrix 


S 


53720081189192390 2* oo 
D 0740^476530315008331 
31&looon^l20114000110 


660201683 41525325023 
^ ^ 4497^009540312100130 
o £ 00400000000021000 000 



Chapter 10 Building the Regression Model II: Diagnostics 403 


Example 


terms. Cook's distance measure can be expressed as follows: 


(f U(d (0 ) 

pMSE 


(10.33a) 


八 

Here, Y as usual is the vector of the fitted values when all n cases are used for the regression 
fit and Y( ( ) is the vector of the fitted values when the ith case is deleted. 

For interpreting Cook's distance measure, it has been found useful to relate D t to the 
F(p, n — p) distribution and ascertain the corresponding percentile value. If the percentile 
value is less than about 10 or 20 percent, the ith case has little apparent influence on the fitted 
values. If, on the other hand, the percentile value is near 50 percent or more, the fitted values 
obtained with and without the ith case should be considered to differ substantially, implying 
that the ith case has a major influence on the fit of the regression function. ^ 

Fortunately, Cook’s distance measure D{ can be cfalculated without fitting a new re¬ 
gression function each time a different case is deleted. An algebraically equivalent expres¬ 
sion is: 


D [= 


pMSE 


ha 

(1 一 ha) 2 


(10.33b) 


Note from (10.33b) that D, depends on two factors: (1) the size of the residual ei and (2) 
the levemge value h“. The larger either e,- or hu is, the larger D { is. Thus, the ith case can 
be influential: (1) by having a large residual e v and only a moderate leverage value h iit ov 
(2) by having a large leverage value hu with only a moderately sized residual ei, or (3 乂 by 
having both a large residual e t and a large levemge value hu . 


For the body fat example with two predictor variables. Table 10.4, column 2, presents the 
D( values. To illustrate the calculations, we consider again case 3, which is outlying with 
regard to its X values. We know from Table 10.3 that es = —3.176 and 知 3 = .372. Further, 
MSE = 6.47 according to Table 7.2c and p = 3 for the model with two predictor variables. 
Hence, we obtain: 


(-3.176) 2 

3(6.47) 


.372 

(1 - .372) 2 


=.490 


We note from Table 10.4, column 2 that case 3 clearly has the largest D t value, with the 
next largest distance measure D 13 — .212 being substantially smaller. Figure 10.8 presents 
the information provided by Cook’s distance measure about the influence of each case in 
two different plots. Shown in Figure 10-8ais a proportional influence plot of the residuals e x 
against the corresponding fitted'values F,*, the size of the plotted points being proportional 
to Cook’s distance measure D ( . Figure 10.8b presents the information about the Cook’s 
distance measures in the form of an index influence plot, wh|；re Cook’s distance measure 
D t is plotted against the corresponding case index i. Both plots in Figure 10.8 clearly show 
that one case stands out as most influential (case 3) and that all the other cases are much less 
influential The proportional influence plot in Figure 10.8a shows that the residual for the 
most influential case is large negative, but does not identify the case. The index influence 
plot in Figure 10.8b, on the other hand, identifies the most influential case as case 3 but 
does not provide any information about the magnitude of the residual for this case. 



404 Part Two Multiple Linear Regression 


10 15 20 25 30 0 5 10 15 20 25 

YHAT Case Index Number 

To assess the magnitude of the influence of case 3 = .490), we refer to the corre¬ 

sponding F distribution, namely, F(p, n — p) ~ F(3, 17). We find that .490 is the 30.6th 
percentile of this distribution. Hence, it appears that case 3 does influence the regression fit 
but the extent of the influence may not be large enough to call for consideration of remedial 
measures. 

Influence on the Regression Coefficients — DFBETAS 

A measure of the influence of the ith case on each regression coefficient (/c = 0, I,..., 
— 1) is the difference between the estimated regression coefficient b k based on all n cases 
and the regression coefficient obtained when the ith case is omitted, to be denoted by b ⑽. 
When this difference is divided by an estimate of the standard deviation of bk, we obtain 
the measure DFBETAS: 

(DFBETAS) kii) = - k : bkU } - = 0, 1， 1 (10.34) 

y/MSE(i) C kk 

where is the /:th diagonal element of (X'X)—Recall from (6.46) that the variance- 
covariance matrix of the regression coefficients is given by tr 2 {b} = cr 2 (X’X) _l . Hence the 
variance of is: 

a 2 {b k } = o 2 c kk (10.35) 

The error term variance a 2 here is estimated by MSE (i) , the error mean square obtained 
• when the ith case is deleted in fitting the regression model. 

The DFBETAS value by its sign indicates whether inclusion of a case leads to an increase 
or a decrease in the estimated regression coefficient, and its absolute magnitude shows 
the size of the difference relative to the estimated standard deviation of the regression 
coefficient. A larae absolute value of {DFBETAS) ka) is indicative of a large impact of the 



FIGURE 10.8 Proportional Influence Plot (Points Porportional in Size to Cook’s Distance Measure) and Index 
Influence Plot — Body Fat Exampie with Two Predictor Variables. 

(a) Proportional Influence Plot (b) Index Influence Plot 


4.5 

— 



0.5 


% 

• 



3.0 


• 

• 

0.4 



3 

d 


2 

°- 


oauu--QS-oco 


■ 5 o _- 5 

1 o 1 

I 

lenpjsa^ 



Chapter 10 Building the Regression Model II: Diagnostics 405 


ith case on the 众 th regression coefficient. As a guideline for identifying influential cases, 
we recommend considering a case influential if the absolute value of DFBETAS exceeds 1 
for small to medium data sets and 2 /^/n for large data sets. 


一一一 For the body fat example with two predictor variables. Table 10.4 lists the DFBETAS values 

- in columns 3, 4, and 5. Note that case 3, which is outlying with respect to its X values, 

is the only case that exceeds our guideline of 1 for medium-size data sets for both by and 
b%- Thus, case 3 is again tagged as potentially influential. Again, however, the DFBETAS 
values do not exceed 1 by very much so that case 3 may not be so influential as to require 
remedial action. % 


Comment ^ 

Cook’s distance measure of the aggregate influence of a cas 已 on then fitted values, which was defined 
in (10.33), is algebraically equivalent to a measure of the aggregate influence of a case on the p 
regression coefficients. In fact. Cook’s distance measure was originally derived from the concept of 
a confidence region for all p regression coefficients p k ( 灸 = 0,1 ， 1) simultaneously. It can 
be shown that the boundary of this joint confidence region for the normal error multiple regression 
model (6.19) is given by: 


(b — p)'X'X(b — p) 
pMSE 


= F(l - a; p,n~ p) 


(1036) 


Cook’s distance measure D : uses the same structure for measuring the combined impact of the ith 
case on the differences in the estimated regression coefficients: 


A ■二 


(b-b (f) )’X ， X(b-!>(,)) 
pMSE 


(10.37) 


where b(” is the vector of the estimated regression coefficients obtained when the /th case is omitted 
and b, as usual, is the vector when all n cases are used. The expressions for Cook’s distance measure 
in (10.33a) and (10.37) are algebraically identical. ■ 


Influence on Inferences 

To round out the determination of influential cases, it is usually a good idea to examine in a 
direct fashion the inferences from the fitted regression model that would be made with and 
without the case(s) of concern. If the inferences are not essentially changed, there is little 
need to think of remedial actions for the cases diagnosed as influential. On the other hand, 
serious changes in the inferences dmwn from the fitted model when a case is omitted will 
require considemtion of remedial measures. 


Example 


In the body fat example with two predictor variables, cases 3 and 15 were identified as 
outlying X observations and cases 8.and 13 as outlying Y observations. All three influence 
measures (DFFITS, Cook’s distance, and DFBETAS) identified only case 3 as influential, 
and, indeed, suggested that its influence may be of marginal importance so that remedial 
measures might not be required. 

The analyst in the body fat example was primarily interested in the fit of the regression 
model because the model was intended to be used for making predictions within the range 
of the observations on the predictor variables in the data set. Hence, the analyst considered 



406 Part Two Multiple Linear Regression 


the fitted regression functions with and without case 3: 

With case 3: Y = —19.174 + .2224X【+ .6594X；i 
Without case 3: ? = -12.428 + .5641 X t + .3635X 2 


Because oi' the high muIticollinearity between X| and X 2 , the analyst was not surprise 
by the shifts in the magnitudes oi' b] -and £> 2 when case 3 is omitted. Remember that the 
estimated standard deviations of the coefficients, given in Table 7.2c, are very large and 
that a single case can change the estimated coefficients substantially when the predictor 
variables are highly conelated. 

To examine the effect of case 3 on inferences to be made from the fitted regression 
function in the range of the X observations in a direct fashion, the analyst calculated for 
each of the 20 cases the relative difference between the fitted value F, based on all 20 cases 
and the fitted value f ⑶ obtained when case 3 is omitted. The measure of interest was the 
average absolute percent difference: 


tl 

E 

/ = I 


^ i( 3 ) 





100 


n 


This mean difference is 3.1 percent; further, 17 of the 20 differences are less than 5 percent 
(calculations not shown). On the basis of this direct evidence about the effect of case 3 on 
the inferences to be made, the analyst was satisfied that case 3 does not exercise undue 
influence so that no remedial action is required for handling this case. 

Some Final Comments 

Analysis of outlying and influential cases is a necessary component of good regression 
analysis. However, it is neither automatic nor foolproof and requires good judgment by the 
analyst. The methods described often work well, but at times are ineffective. For example, 
if two influential outlying cases are nearly coincident, as depicted in Figure 10.5 by cases 3 
and 4, an analysis that deletes one case at a time and estimates the change in fit will result in 
virtually no change for these two outlying cases. The reason is that the retained outlying case 
will mask the effect of the deleted outlying case. Extensions of the single-case diagnostic 
procedures described here have been developed that involve deleting two or more cases 
at a time. However, the computational requirements for these extensions are much more 
demanding than for the single-case diagnostics. Reference 10.4 describes some of these 
extensions. 

Remedial measures for outlying cases that are determined to be highly influential by the 
diagnostic procedures will be discussed in the next chapter. 


10.5 Multicollineai ity Diagnostics — Variance Inflation Parlor 

_ •: <. ) ___ _ _ _ _ _ ______ _ __ 一 

When we discussed multicollinearity in Chapter 7, we noted some key problems that typi- 
cally arise when the predictor variables being considered for the regression model are highl) 
correlated among themselves: 

1. Adding or deleting a predictor variable changes the regression coefficients. 



Chapter 10 Building the Regression Model II: Diagnostics 407 


2. The extm sum of squares associated with a predictor variable varies, depending upon 
which other predictor variables are already included in the model. 

3. The estimated standard deviations of the regression coefficients become large when 
the predictor variables in the regression model are highly correlated with each other. 

4. The estimated regression coefficients individually may not be statistically significant 
even though a definite statistical relation exists between the response variable and the set 
of predictor variables. 


These problems can also arise without substantial multicollinearity being present, but only 
under unusual circumstances not likely to be found in practice. 

We first consider some informal diagnostics for multicollinearity and then a highly useful 
formal diagnostic, the variance inflation factor. 

Informal Diagnostics i 

Indications of the presence of serious multicollinearity are given by the following informal 
diagnostics: 

， 1 - Large changes in the estimated regression coefficients when a predictor variable is added 

or deleted, or when an observation is altered or deleted. 

2. Nonsignificant results in individual tests on the regression coefficients for important 
predictor variables. 

3. Estimated regression coefficients with an algebmic sign that is the opposite of that 
expected from theoretical considemtions or prior experience. 

4. Large coefficients of simple correlation between pairs of predictor variables in the cor¬ 
relation matrix r X x- 

5. Wide confidence intervals for the regression coefficients representing important predictor 
variables. 


We consider again the body fat example of Table 7.1, this time with all three predictor 
variables — triceps skinfold thickness (Xi), thigh circumference (X 2 ), and midarm circum¬ 
ference (X 3 ). We noted in Chapter 7 that the predictor variables triceps skinfold thickness 
and thigh circumference are highly correlated with each other. We also noted large changes 
in the estimated regression coefficients and their estimated standard deviations when a vari¬ 
able was added, nonsignificant results in individual tests on anticipated important variables, 
and an estimated negative coefficient when a positive coefficient was expected. These are all 
informal indications that suggest serious multicollinearity among the predictor variables. 


Comment ' 

The informal methods just described have important limitations. They do not provide quantitative 
measurements of the impact of multicollinearity and they may not identify the nature of the multi¬ 
collinearity. For instance, if predictor variables X y , X 2 , and X 3 have low pairwise correlations, then 
the examination of simple correlation coefficients may not disclose the existence of relations among 
groups of predictor variables, such as a high correlation between X\ and a linear combination of X 2 
and X 3 . 

Another limitation of the informal diagnostic methods is that sometimes the observed behavior 
may occur without multicollinearity being present. ■ 



408 Part Two Multiple Linear Regression 


Variance Inflation Factor 

A formal method of detecting the presence of multicollinearity that is widely accepted is 
use of variance inflation factors. These factors measure how much the variances of 
estimated regression coefficients are inflated as compared to when the predictor variables 
are not linearly related. 

To understand the significance of variance inflation factors, we begin with the precision 
of least squares estimated regression coefficients，which is measured by their variances. We 
know from (6.46) that the variance-covariance matrix of the estimated regression coeffi¬ 
cients is: 

o 2 {b\ = fi 2 (X'X 厂 I (10.38) 

For purposes of measuring the impact of multicollineality, it is useful to work with the 
standardized regression model (7.45), which is obtained by transforming the variables by 
means of the correlation transformation (7.44). When the standardized regression model is 
fitted, the estimated regression coefficients bf are standardized coefficients that are related 
to the estimated regression coefficients for the Lintransformed variables according to (7.53). 
The variance-covariance matrix of the estimated standardized regression coefficients is ob¬ 
tained from (10.38) by using the result in (7.50), which states that the X’X matrix for the 
transformed variables is the correlation matrix of the X variables r xx . Hence, we obtain: 

<r 2 {bn = (^) 2 r^x (10.39) 

where r X x is the matrix of the pairwise simple correlation coefficients among the X vari¬ 
ables, as defined in (7.47)，and (cr^) 2 is the error term variance for the transformed model. 

Note from (10.39) that the variance of b\ (k — 1) is equal to the following, 

letting (VIF) k denote the kth diagonal element of the matrix 

^ = (o*) 2 {VlF) k (10.40) 

The diagonal element {VIF) k is called the variance inflation factor (VIF) for b*. It can be 
shown that this variance inflation factor is equal to: 

(VIF) k = (I - ^ 2 )^' /： = 1,2 . p- I (10.41) 

where R]: is the coefficient of multiple determination when is regressed on the - 2 
other X variables in the model. Hence, we have: 

(10.42) 

We presented in (7.65) the special results for cr 2 {t>^] when p — \ ~2, for which Rj = 
the coefficient of simple determination between X| and X-±. 

The variance inflation factor (VIF) k is equal to 1 when Rl — 0, i.e., when X k is not linearly 
related to the other X variables. When ^ 0, then (VIF) k is greater than 1, indicating an 
inflated variance for b* k as a result of the intercorrelations among the X variables. When Xk 
has a perfect linear association with the other X variables in the model so that R-j: = I ， then 
(VIF) k and <7 2 {lf k } are unbounded. 




(o- 




Chapter 10 Building the Regression Model II: Diagnostics 409 


femple 


Diagnostic Uses. The largest VIF value among all X variables is often used as an indicator 
of the severity of multicollinearity. A maximum VIF value in excess of 10 is frequently taken 
as an indication that multicollinearity may be unduly influencing the least squares estimates. 

The mean of the VIF values also provides information about the severity of the multi¬ 
collinearity in terms of how far the estimated standardized regression coefficients are 
from the true values It can be shown that the expected value of the sum of these squared 
errors (b^ — j6|) 2 is given by: 

rp-i 'i p-i 

e\ - Kf > = (A 2 (10.43) 

I k=l J k=l 

Thus, large VIF values result, on the average, in larger differences between the estimated 
and true standardized regression coefficients. 令 

When no X variable is linearly related to the others in the r regression model, = 0; 
hence, (VIF) k = 1, their sum is p — 1, and the expected value of the sum cSf the squared 
errors is: " 

- © 2 卜 (^*) 2 (P - 1) when (VIF) k ^ 1 (10.43a) 

A ratio of the results in (10.43) and (10.43a) provides useful information about the effect 
of multicollinearity on the sum of the squared errors: 

( a *) 2 E (硬从 = E (卿) 

— 1) p - 1 

Note that this ratio is simply the mean of the VIF values, to be denoted by (VIF): _ 

(VIF )= 江 (10.44) 

p - 1 

Mean VIF values considerably larger than 1 are indicative of serious multicollinearity 
problems. 

Table 10.5 contains the estimated standardized regression coefficients and the VIF values for 
the body fat example with three predictor variables (calculations not shown). The maximum 
of the V7F values is 708.84 and their mean value is (VIF) = 459.26. Thus, the expected sum 
of the squared errors in the least squares standardized regression coefficients is nearly 460 
times as large as it would be if the X variables were uncorrelated. In addition, all three VIF 
values greatly exceed 10, which again indicates that serious multicollinearity problems exist. 


TABLE 10.5 


^^ctors—Body 

辟 ^Example 



Variable 

%L 



•• -• J: -r- s •；.. - 

4.2637 

5 / '■ ' ' 

—2:9287 

，.+ 



708.84 
564.34» 
,1043^61 


Maxfrnum (WF) A - 70B-B4 459v26 ,： 



410 Part Two Multiple Linear Regression 


23 (see 


It is interesting to note that (V/F) 3 = 105 despite the fact that both / and 
Figure 7.3b) are not large. Here is an instance where is strongly related to andv C 
together (Rj = .990), even though the pairwise coefficients of simple determination are n{J 
large. Examination of the pairwise correlations does not disclose this multicollineaiity 


Comments 

1. Some computer regression programs use the reciprocal of the variance inflation factor to det 
Instances where an X variable should not be allowed Into the fitted regression model because of exces¬ 
sively high interdependence between this variable and the other X variables in the model. Tolerance 
limits for \/{VlF)i = I — Rf frequently used are .01, .(X) I, or .0001 ， below which the variable is no t 
entered into the model. 


2. A limitation of vaiiance inflation factors for detecting nuilticoUlhearities is that they c 咖 ot 
distinguish between several simultaneous' multicollinearitieK. 


3. A number of other formal methods for detecting multicollinearity have been proposed. These 
are more complex than variance inflation factors and are discussed in specialized texts such as Ref¬ 
erences 10.5 and 10.6. m 


10.6 Surgical Unit Example — Conti lined_ 

In Chapter 9 we developed a regression model for the surgical unit example (data in 
Table 9.1). Recall that validation studies in Section 9.6 led to the selection of model (9,2i), 
the model containing variables X i, Xi, X、, and X s . We will now utilize this regression model 
to demonstrate a more in-depth study of curvature, interaction effects, lnulticollinearity, and 
influential cases using residuals and other diagnostics. 

To examine interaction effects further, a regression model containing first-order terms 
in X|, X 2 , X 3 , and was fitted and added-variable plots for the six two-factor interaction 
terms, X| Xi, X\ X| X x , X 2 X 3 , and X 3 X 8 , were examined. These plots (not 

shown) did not suggest that any strong two-variable interactions are present and need to be 
included in the model. The absence of any strong interactions was also noted by fitting a 
regression model containing X 1 , X 2 , X 3 , and X 8 in first-order terms and all two-van able 
interaction terms. The P-value of the formal F test statistic (7.19) for dropping all of the 
interaction terms from the model containing both the first-order effects and the interaction 
effects is .35, indicating that interaction effects are not present. 

Figure 10.9 contains some of the additional diagnostic plots that were generated to ched 
on the adequacy of the first-order model: 


C == A) 十 Pi + 戶 2&2 十 + ^8^/8 + £j (10.45) 

where K/ = In 匕 ， The following points are worth noting: 

1. The residual plot against the fitted values in Figure 10.9a shows no evidence of serious 
departures from the model. 

2. One of the three candidate models (9.23) subjected to validation studies in Section 9.6 
contained (patient age) as a predictor. The regression coefficient for age (b^) was negative 
in model (9.23), but when the same model was fit to the validation data, the sign of became 
positive. We will now use a residual plot and an added-variable plot to study graphically 


Chapter 10 Building the Regression Model II: Diagnostics 411 


^URE 10.9 

做 Sj^cal 



.Model (10.45). 



t 

(c) Added-Variable Plot for X 5 J (d) Normal Probability Plot 



-20 -10 0 10 20 -4 -2 0 2 4 

6^X^\X^, X 2 , ^ 3 , ^g) Expected X^ilue 


the strength of the marginal relationship between Xs and the response, when X u X 2l X 3 , 
and X 8 are already in the model. Figure 10.9b shows the plot of the residuals for the model 
containing X u X 2 , X3, and X & against X5, the predictor variable not in the model. This 
plot shows no need to include patient age (X 5 ) in the model to predict logarithm of survival 
time. A better view of this marginal relationship is provided by the added-variable plot in 
Figure 10.9c. The slope coefficient bs can be seen again to be slightly negative as depicted 
by the solid line in the added-variable plot. Overall, however, the marginal relationship 
between X 5 and Y' is weak. The f-value of the formal t test (9.18) for dropping X 5 from 
the model containing X u X^, X3, X 5 and X & is 0.194. In addition, the plot shows that the 
negative slope is driven largely by one or two outliers~one in the upper left region of 
the plot, and one in the lower right region, In’this way the added-variable plot provides 
additional support for dropping X^. ^ 

3. The normal probability plot ofthe residuals in Figure 10.9d shows little departure from 
linearity. The coefficient of correlation between the ordered residuals and their expected 
values under normality is .982, which is larger than the critical value for significance level .05 
in Table B.6. 



412 Part Two Multiple Linear Regression 


Multicollinearity was studied by calculating the variance inflation factors: 


Variable 


入 i 

1.10 

入 2 

1.02 

X 3 

1.05 

^8 

1.09 


As may be seen from these results, multicollinearity among the four predictor variables is 
not a problem. ' 

Figure 10.10 contains index plots of four key regression diagnostics, namely the deleted 
studendzed residuals tj in Figure 10.10a. the leverage values h,, in Figure 10.1 Ob, Cook’s 
distances D, in Figure 10.10c, and DFFITSi values in Figure 10. lOd. These plots suggest 
further study of cases 17, 28, and 38, Table 10.6 lists numerical diagnostic values for 
these cases. The measures presented in columns 1-5 are the residuals e t in (10.8), the 
studentized deleted residuals in (10.24), the leverage values hj, in (10.18), the Cook’s 
distance measures D, in (10.33), and the {DFFITS), values in (10.30). The following are 
noteworthy points about the diagnostics in Table 10.6: 

1. Case 17 was identified as outlying with regard to its Y value according to its studentized 
deleted residual, outlying by more than three standard deviations. We test formally whether 
case 17 is outlying by means of the Bonferroni test procedure. For a family significance 
level of a = .05 and sample size n ― 54, we require t (1 —a/2n\n — /) — 1 ) 二 f (.99954; 49) 
二 3.528, Since |f } 7 | = 3,3696 < 3.528, the formal outlier test indicates that case 22 is not 
an outlier. Still, t\-j is very close to the critical value, and although this case does not appear 
to be outlying to any substantial extent, we may wish to investigate the influence of case 17 
to remove any doubts. 

2. With 2p/n — 2(5)/54 = .185 as a guide for identifying outlying X observations, 
cases 23, 28, 32, 38, 42, and 52 were identified as outlying according to their leverage 
values. Incidentally, the univariate dot plots identify only cases 28 and 38 as outlying. Here 
we see the value of multivariable outlier identification. 

3. To determine the influence of cases 17,23,28,32,38,42,32, and 52, we consider their 
Cook 5 s distance and DFFITS values. According to each of these measures, case 17 is the 
most influential, with Cook's distance D !7 = .3306 and (DFFITS)^ = 1.4151. Referring 
to the F distribution with 5 and 49 degrees of freedom, we note that the Cook’s value 
corresponds to the 11 th percentile. It thus appears that the influence of case 38 is not large 
enough to warrant remedial measures, and consequently the other outlying cases also do 
not appear to be overly influential. 

A direct check of the influence of case 17 on the inferences of interest was also conducted. 
Here, the inferences of primary interest are in the fit of the regression model because the 
model is intended to be used for making predictions in the range of the X observatiOTS. 
Hence, each fitted value Yj based on all 54 observations was compared with the fitted value 
y, ( [ 7 ) when case 17 is deleted in fitting the regression model. The average of the absolute 
percent differences; 


Y 


/(L7) 




Yi 


100 



Chapter 10 Building the Regression Model II ： Diagnostics 413 


figure 10.10 Diagnostic Plots for Surgical Unit Example 一 Regression Model (10.45). 


(a) Studentized Deleted Residuals 





if 


0.35 

0.3 

0.25 

0.2 

0.15 

0.1 

0.05 

0 


(b) Leverage Values 


10 


20 30 

Case Index 


40 


50 


(c) Cook's Distance 


0.35 

0.3 


17 


1.5 


(d) DFFITS Values 


0.1 


0.05 




10 


20 30 40 

Case Index 


50 




一 0.5 


-1 


10 20 30 40 

Case Index 


50 




「ABLE 10.6 

Various 

Diagnostics for 
Ouflying 
Cflscs - ~ 
Surgical Unit 
Example, 
Regression 
Model (10.45). 


Case 

Nyrfibfer 

•Vr 

I 

17 

23 

-，： r 

28 

32 

38 

42 

52 


( 1 ) 


D.5952 

^ -yd 

0.2788 

0.0876 

-0^861 

^0*0303：. 

-0.1375 


( 2 ) 


% 

3：3696 

1.485^ 

£> r ； m 

0.4896 
1.5585 
1 3016 
0 1620 
-0.7358 


(3) 

m 

⑸ 


Df 

(Dmrs), 

0,1499 

Gr3306 

1^151 

0.1885 

_ . ； - ^ 

硪讎 1 

G.7160 

' C- \ K • ^ 

0.2914 

0.0200 

■ - ■: ' ：-.• 

0.3140 

02202 

0.T333 

-0.8283 

0.3059 

0：1472 

—0.8641 

0.2262 

mMn 

-0.0876 

/ .. 

0.2221 

0.0312 

-0.3931 


5 

a 


o 


*s 卜 f-a 


5 

2 

°- 


■ 2 

°- 




一 a 



414 Part Two Multiple Linear Regression 


is only .42 percent, and the largest absolute percent difference (which is for case 17) is on| 

1.77 percent. Thus, case 17 does not have such a disproportionate influence on the 
values that remedial action would be required. 

4. In summary, the diagnostic analyses identified a number of potential problems but 
none of these was considered to be serious enough to require further remedial action 


Cited 

References 


10. L Atkinson, A. C. Plots, Trausforiiuitions, and Regression. Oxford; Clarendon Press, 1987. 

10.2, Mansfiekl, E, and M. D. Conerly. “Diagnostic Value of Residual and Partial Residual Plots” 

The American Statistician 41 (1987), pp, 107—16 - ’ 

10.3, Cook, R, D. '"Exploring Partial Residual Plots:’ Technometrics 35 (1993), pp. 351-62, 

10A Rousseeuw, R J., and A.M, Leroy, Robust Regression and Outlier Detection, New York； John 
Wiley & Sons, 1987. 

10.5, Belsley, D. A.； E 4 Kuh； and R. E. Welsch. Regression Diagnostics: Identifying Influemictl Data 
and Sources ofCollinearity. New York: John Wiley & Sons，1980. 

10.6. Belsley, D. A, Conditioning Diagnostics: CoUinearity and Weak Data in Regression, New York 4 
John Wiley & Sons, 199 L 


Problems 


10.1. A student asked: "Why is it necessary to perform diagnostic checks of the fit when R 2 is 
large?” Comment. 

10.2. A researcher stated; “One good thing about added-variable plots is that they are extremely 
useful for identifying model adequacy even when the predictor variables are not properly 
specified in the regression model/' Comment. 

10.3. A student suggested： “If extremely influential outlying cases are detected in a data set, simply 
discard these cases from the data set.” Comment. 

10.4. Describe several informal methods that can be helpful in identifying multicollinearity among 
the X variables - in a multiple regression model. 

10.5. Refer to Brand preference Problem 6.5b. 

a. Prepare an added-variable plot for each of the predictor variables. 

b. Do your plots in part (a) suggest that the regression relationships in the fitted regression 
function in Problem 6.5b are inappropriate for any of the predictor variables'? Explain. 

c. Obtaih the fitted regression function in Problem 6.5b by separately regressing both Fand 
X 2 on X[, and then regressing the residuals in an appropriate fashion. 

10.6 - Refer to Grocery retailer Problem 6.9. 

a. Fit regression model (6.1) to the data using X x and only. 

b. Prepare an added-variable plot for each of the predictor variables X { and X 2 . 

c. Do your plots in pan (a) suggest that the regression relationships in the fitted regressiai 
function in part (a) are inappropriate for any of the predictor variables? Explain. 

d. Obtain the fitted regression function in part (a) by separately regressing both Y and Xi ® 
X|, ahd then regressing the residuals in an appropriate fashion. 

1 0.7. Refer to Patient satisfaction Problem 6. 1 5c. 

a. Prepare an addetl-variable plot for each of the predictor variables. 

b. Do your plots in part (a) suggest that the regression relationships in the titted regi'essior 
function in Problem 6.15c are inappropriate for any of the predictor variables? Explain. 



Chapter 10 Building the Regression Model II ： Diagnostics 415 


10.8. Refer to Commercial properties Problem 6.18c. 

a. Prepare an added-variable plot for each of the predictor variables. 

b. Do your plots in part (a) suggest that the regression relationships in the fitted regres¬ 
sion function in Problem 6.18c are inappropriate for any of the predictor variables? 
Explain. 

10.9. Refer to Brand preference Problem 6.5. 

a. Obtain the studentized deleted residuals and identify any outlying Y observations. Use the 
Bonferroni outlier test procedure with o; = .10. State the decision rule and conclusion. 

h, Obtain the diagonal elements of the hat matrix, and provide an explanation for the pattern 
in these elements. 

c. Are any of the observations outlying with regard to their X values according to the rule of 

thumb stated in the chapter? -3 

d. Management wishes to estimate the mean degree of brand liking for moisture content 
X[ = 10 and sweetness X 2 = 3. Construct a scatter plot of X 2 against X\ fend determine 
visually whether this prediction involves an extrapolation beyond the range of the data. 
Also, use (10.29) to determine whether an extrapolation is involved. Do your conclusions 
from the two methods agree? 

e. The largest absolute studentized deleted residual is for case 14. Obtain the DFFITS, 
DFBETAS, and Cook’s distance values for this case to assess the influence of this case. 
What do you conclude? 

f. Caloilate the average absolute percent difference in the fitted values with and without 
case 14. What does this measure indicate about the influence of case 14? 

g. Calculate Cook’s distance D t for each case and prepare an index plot. Are any cases 
influential according to this measure? 

*10.10. Refer to Grocery retailer Problems 6.9 and 6.10. 

a. Obtain the studentized deleted residuals and identify any outlying Y observations. Use the 
Bonferroni outlier test procedure with a = .05. State the decision rule and conclusion, 

b. Obtain the diagonal element of the hat matrix. Identify any outlying X ofeervations using 
the rule of thumb presented in the chapter. 

c. Management wishes to predict the total labor hours required to handle the next shipment 
containing = 300,000 cases whose indirect costs of the total hours is X 2 = 7.2 and 
X 3 = 0 (no holiday in week). Construct a scatter plot of X 2 against Xy and determine 
visually whether this prediction involves an extrapolation beyond the range of the data. 
Also, use (10.29) to determine whether an extrapolation is involved. Do your conclusions 
from the two methods agree? 

d. Cases 16,22,43, and 48 appear to be cxitlying X observations, and cases 10,32,38, and 40 
appear to be outlying Y observations. Obtain the DFFITS, DFBETAS, and Cook’s distance 
values for each of these cases to assess their influence. What do you conclude? 

e. Calculate the average absolute percent difference in the fitted values with and without eadi 
of these cases. What does this measure indicate about the influence of each of the cases? 

f. Calculate Cook’s distance D t for each c^se and prepare an index plot. Are any cases 

influential according to this,measure? ^ 

*10.11. Refer to Patient satisfaction Problem 6.15, 

a. Obtain the studentized deleted residuals and identify any outlying Y observations. Use the 
Bonferroni outlier test procedure with a = . 10. State the decision rule and conclusion. 

b. Obtain the diagonal elements of the hat matrix. Identify any outlying X observations. 



416 Part Two Multiple Linear Regression 


c. Hospital management wishes to esiimate mean patient satisfaction for patients who are 
X| = 30 years old, whose index of illness severity is X 2 - - 58, and whose index of anxiety 
level is = 2.0. Use (10.29) to determine whether this estimate will involve a hidd^ 
extrapolation. 

d. The ilii.ee largest absolute studentized deleted residuals are for cases II, 17, and 27. Obtain 
the DFFITS, DFBETAS, and Cook's distance values for this case to assess its influence 
What do you conclude? 

e. Calculate the average absolute percent difference in the fitted values with and without each 
of these cases. What does this measure indicate about the influence of each of these cases? 

f ‘ Calculate Cook's distance D, for each case and prepare an index plot. Are any cases 
influential according to this measure? 

10.12. Refer to Commercial Properties Problem 6.18. 

a. Obtain the studentized deleted re si duals and identify any outlying Y observations. Use the 
Bont'erroni outlier test procedure with a- = .01. State the decision rule and conclusion. 

b. Obtain the diagonal elements of the hat matrix. Identify any outlying X observations. 

c. The researcher wishes to estimate the rental rates of a property whose age is 10 years, 
whose operating expenses and taxes are 12.00, whose ocupancy rate is 0.05, and whose 
square footage is 350,000. Use (10.29) to determine whether this estimate will involve a 
hidden extrapolation. 

d. Cases 61, 8, 3, and 53 appear to be outlying X observations, and cases 6 and 62 appear 
to be outlying Y observations. Obtain the DFFITS, DFBETAS, and Cook’s distance values 
for each case to assess its influence. What do you conclude? 

e. Calculate the average absolute percent difference in the fitted values with and without each 
of the cases. What does this measure indicate about the influence of each case? 

f. Calculate Cook’s distance Dj for each case and prepare an index plot. Are any cases 
influential according to this measure? 

10.13. Cosmetics sales. An assistant in the district sales office of a national cosmetics firm obtained 
data, shown below, on advertising expenditures and sales last year in the district's44 territories. 
XI denotes expenditures for point-of-sale displays in beauty salons and department stores (in 
thousand dollars), and Xi and X 3 represent the corresponding expenditures for local me(Sa 
advertising and prorated share of national media advertising, respectively. Y denotes sales (in 
thousand cases). The assistant was insti-ucted to estimate the increase in expected sales when 
X\ is increased by I thousand dollars and X 2 and are held constant, and was told to use 
an ordinary multiple regression model with linear terms for the predictor variables and with 
independent normal error terms. 


1 2 3 42 43 44 


入 n: 

5.6 

4.1 

3.7 

. 3.6 

3.9 

5.5 

X,2 ： 

5.6 

4.8 

3.5 . 

. 3.7 

3.6 

5.0 

入， 3: 

3.8 

4.8 

3,6 . 

.. 4.4 

2.9 

5.5 

n ： 

12.85 

11.55 

12.78 

10.47 

11.03 

12.31 


a. State the regression model to be employed and fit it to the data. 

b. Test whether there is a regression relation between sales and the three predictor variables; 
use a = .05. State the alternatives, decision rule, and conclusion. 

c. Test for each of the regression coefficients {k = \. 2. 3) individually whether oi" not 
fi k — 0; use a: = .05 each time. Do the conclusions of these tests coiTespond to that obtained 
in part (b)? 







Chapter 10 Building the Regression Model II ： Diagnostics 417 


d. Obtain the correlation matrix of the X variables. 

e. What do the results in parts (b), (c), and (d) suggest about the suitability of the data for the 
research objective? 

10.14. Refer to Cosmetics sales Problem 10.13. 

a. Obtain the three variance inflation factors. What do these suggest about the effects of 
multicollinearity here? 

h The assistant eventually decided to drop variables X 2 and X 3 from the model “to clear up 
the picture,” Fit the assistant’s revised modeL Is the assistant now in a better position to 
achieve the research objective? 

c. Why would an experiment here be more effective in providing suitable data to meet the 
research objective? How would you design such an experiment? What regression model 
would you employ? 3 

10.15. Refer to Brand preference Problem 6.5a. 

a. What do the scatter plot matrix and the correlation matrix show about 
associations among the predictor variables? 

b. Find the two variance inflation factors. Why are they both equal to 1 ? 

*10.16. Refer to Grocery retailer Problem 6.9c. 

a. What do the scatter plot matrix and the correlation matrix show about 
associations among the predictor variables? 

b. Find the three variance inflation fectors. Do they indicate that a serious multicollinearity 
problem exists here? 

*10.17. Refer to Patient satisfaction Problem 6.15b. 

a. What do the scatter plot matrix and the correlation matrix show about pairwise linear 

associations among the predictor variables? " 

b. Obtain the three variance inflation factors. What do these results suggest about the effects 
of multicollinearity here? Are these results more revealing than those in part (a)? 

10.18. Refer to Commercial properties Problem 6.18b. 

a. What do the scatter plot matrix and the correlation matrix show about pairwise linear 
associations among the predictor variables? 

b. Obtain the four variance inflation factors. Do they indicate that a serious multicollinearity 
problem exists here? 

10.19. Refer to Job proficiency Problems 9.10 and 9.1 1 . The subset model containing only first-order 

terms in Xi and X 3 is to be evaluated in detail. 

a. Obtain the residuals and plot them separately against Y, each of the four predictor variables, 
and the cross-product term On the basis of these plots, should any modifications in 
the regression model be investigated? 

b. Prepare separate added-variable plots against e(X!|X 3 ) and Do these plots 

suggest that any modifications in the model form are warranted? 

c. Prepare a normal probability plot of the- residuals. Also obtain the coefficient of corre¬ 

lation between the ordered residuals and their expected values under normality. Test the 
reasonableness of the norufality assumptions, using Table B .6 and a = .01. What do you 
conclude? . 

d. Obtain the studentized deleted residuals and identify any outlying Y observations. 
Use the Bonferroni outlier test procedure with a = .05. State the decision rule and 
conclusion. 


^airwisi 


e linear 


pairwise linear 



418 Part Two Multiple Liiwnr Regression 


e. Obtain the diagonal elements of the hut matrix. Using the rule of ihumb in the text, identify 
any outlying X observations. Are your findings consistent with those in Problem 9.l{) a ? 
Should they be? Comment. 

t'. Cases 7 and 18 appear to be moderately outlying with respect to their X values, ^ 
case 16 is reasonably far outlying with respect to its Y value. Obtain DFFITS' DFBETAS 
and Cook's distance values for these cases to assess their influence. What do you conclude? 
g. Obtain the variance inflation factors. What do they indicate? 

10.20. Refer to Lung pressure Problems 9.13 and 9.14. The subset regression model contain¬ 
ing Hrst-ordcr terms for X x and X! and the cross-product term X\Xi is to be evaluated in 
detail. 


a. Obtain the residuals and plot them separately against Y and each of the three ptedictor 
variables. Oil the basis of these plois, should any further modifications of the regression 
model be attempted? 


b. Prepare a normal probability plot of the residuals. Also obtain the coefficient of correlation 
between the ordered resiductLs and their expected values under normality. Does the normality 
assumption appear to be reasonable here?. 


c. Obtain the variance inflation factors. Are there any indications that serious multicollinearity 
problems are present? Explain, 

cl. Obtain the studentized deleted residuals and identify any oullying Y observations. Use the 
Bonferroni outlier test procedure with a = .05. State the decision rule and conclusion. 


e. Obtain the diagonal elements of the hat matrix. Using the rule of thumb in the text, identify 
any outlying X observations. Are your findings consistent with those in Problem 9,13a? 
Should they be? Discuss. 

f. Cases 3, 8, and 15 are moderaiely far outlying with respect to their X values, and case 7 is 
relatively far outlying with respect to its Y value. Obtain DFFITS, DFBETAS, iind Cook’s 
distance values for these cases to assess their influence. What do you conclude? 


*10.21. Refer to Kidney function Problem 9.15 and the regression model fitted in part (c). 


a. Obtain the variance inflation factors. Are there indications that serious mu I ticol linearity 
problems exist here? Explain. 

b. Obtain the residuals and plot them separately against Y and each of the predictor variables. 
Also prepare a normal probability plot of the residuals. 

c. Prepare separate added-variable plots against e(Xi\X 2 , X 3 ), X 3 ), and 

Xt). 

d. Do the plots in parts (b) and (c) suggest that the regression model should be modified? 


*10.22. Refer to Kidney function Problems 9.15 and 10.21. Theoretical arguments suggest use of the 
following regression function: 


E{\n Y) = fi { ^ + fi { In X, + fi 2 \n(\40 - X 2 ) + fh In X 3 


a. Fii the regression function based on theoretical considerations. 

b. Obtain the residuals and plot them separately against Y and each predictor variable in the 
fitted model. Also prepare a normal probability plot of the residuals. Have the difficulties 
noied in Problem 10.21 now largely been eliminated? 

c. Obtain the variance inflaiion faciors. Are there indications that serious mu I licol linearity 
problems exist here? Explain. 

d. Obtain the studentized deleted residuals and identify any outlying Y observations. Use the 
Bonferroni outlier test procedure with tv = . 10. State the decision rule and conclusion. 



Chapter 10 Building the Regression Model II: Diagnostics 419 


e. Obtain the diagonal elements of the hat matrix. Using the rule of thumb in the text, identify 
any outlying X observations. 

f. Cases 28 and 29 are relatively far outlying with respect to their Y values. Obtain DFFITS, 
DFBETAS, and Cook’s distance values for these cases to assess their influence. What do 
you conclude? 


Exercises 


10.23. Show that (10.37) is algebraically equivalent to (10.33a). 

10.24. If n — p and the X matrix is invertible, use (5.34) and (5.37) to show that the hat matrix H is 
given by ihe p x p identity matrix. In this case, what are ha and K f ? 

10.25, Show that (10,26) follows from (10.24a) and (10.25). \ 

10.26, Prove (9.11), using (10.27) and Exercise 5.31. ^ 


Projects 


_i_ 

10.27. Refer to the SENIC dataset in Appendix C. 1 and Project 9.25. The regression model containing 
age, routine chest X-ray ratio, and average daily census in first-order terms is to be evaluated 
in detail based on the model-building data set. 

a. Obtain the residuals and plot them separately against K, each of the predictor variables in 
the model, and each of the related cross-product terms. On the basis of these plots, should 
any modifications of the model be made? 

b. Prepare a normal probability plot of the residuals. Also obtain the coefficient of corre¬ 
lation between the ordered residuals and their expected values under normality. Test the 
reasonableness of the normality assumption, using Table B.6 and a = .05. What do you 
conclude? 

c. Obtain the scatter plot matrix, the correlation matrix of the X variables, and the variance 
inflation factors. Are there any indications that serious multicollinearity problems are 
present? Explain. 

d. Obtain the studentized deleted residuals and prepare a dot plot of these residuals. Are any 
outliers present? Use the Bonferroni outlier test procedure with a = .01. State the decision 
rule and conclusion. 

£. Obtain the diagonal elements of the hat matrix. Using the rule of thumb in the text, identify 
any outlying X observations. 

f. Cases 62, 75, 106, and 112 are moderately outlying with respect to their X values, and 
case 87 is reasonably far outlying with respect to its Y value. Obtain DFFITS, DFBETAS, 
and Cook’s distance values for these cases to assess their influence. What do you 
conclude? 

10.28. Refer to the CDI data set in Appendix C.2 and Project 9.26. The regression model containing 
variables 6, 8, 9, 13, 14, and 15 in*first-order terms is to be evaluated in detail based on the 
model-building data set ， 

a. Obtain the residuals and plot them separately against F, each predictor variable in the model, 

and the related cross-product term. On the basis of these plots, should any modifications 
in the model be made? * ^ 

b. Prepare a normal probability plot of the residuals. Also obtain the coefficient of corre¬ 
lation between the ordered residuals and their expected values under normality. Test the 
reasonableness of the normality assumption, using Table B.6 and a = .01. What do you 
conclude? 



420 Part Two Multiple Linear Regression 


Case 

Studies 


c. Obtain the scatter plot matrix, the correlation matrix of the X variables, and the variance 
inflation factors. Are there any indications that serious multicollinearity problems are 
present? Explain. 

d. Obtain the studentized deleted residuals and prepare a dot plot of these residuals. Are 如 
outliers present? Use the Bonferroni outlier test procedure with a = ,05. State the decisioj] 
rule and conclusion. 


e. Obtain the diagonal elements of the hat matrix. Using the rule of thumb in the text, identify 
any outlying X observations. 

f. Cases 2, 8, 48, 128,206, and 404 are outlying with respect to their X values, and cases 2 

and 6 are reasonably far outlying with respect to their 7 values. Obtain DFBETAS 

and Cook’s distance values for these cases to assess their influence. What do you conclude; 


10.29. Refer to the Website developer data set in Appendix C,6 and Case Study 9.29, For the best 
subset model developed in Case Study 9.29, perform appropriate diagnostic checks to evaluate 
outliers and assess their influence. Do any serous multicollinearity problems exist here? 

10.30. Refer to the Prostate cancer data set in Appendix C.5 and Case Study 9.30. For the best 
subset model developed in Case Study 9.30, perform appropriate diagnostic checks to evaluate 
outliers and assess their influence. Do any serious multicollinearity problems exist here? 

10.31. Refer to the Real estate data set in Appendix C.7 and Case Study 9.31. For the best sub¬ 
set model developed in Case Study 9.31, perform appropriate diagnostic checks to evaluate 
outliers and assess their influence. Do any serious multicollinearity problems exist here? 


Chapter 


Building the Regression 
Model III ： Remedial" 

Measures ^ 

When the diagnostics indicate that a regression model is not appropriate or that one or sev¬ 
eral cases are very influential, remedial measures may need to be taken. In earlier chapters, 
we discussed some remedial measures, such as transformations to linearize the regression 
relation, to make the error distributions more nearly normal, or to make the variances of 
the error terms more nearly equal. In this chapter, we take up some additional remedial 
measures to deal with unequal error variances, a high degree of multicollinearity, and influ¬ 
ential observations. We next consider two methods for nonparametric regression in detail, 
lowess and regression trees. Since these remedial measures and alternative approaches of¬ 
ten involve relatively complex estimation procedures, we consider next a general approach, 
called bootstrapping, for evaluating the precision of these complex estimators. We con¬ 
clude the chapter by presenting a case that illustrates some of the issues that arise in model 
building. 


11.1 Unequal Error Variances Remedial Measures — Weighted 
_Least Squares_ 

We explained in Chapters 3 and 6 how transformations of Y may be helpful in reducing or 
eliminating unequal variances of the error terms. A difficulty with transformations of Fis that 
they may create an inappropriate recession relationship. When an appropriate regression 
relationship has been found'but the variances of the error terms are unequal, an alternative 
to transformations is weighted least squares,,a procedure based on a generalization of 
multiple regression model (6.7). We shall now denote the variance of the error term by 
of to recognize that different error terms may have different variances. The generalized 
multiple regression model can then be expressed as follows: 

^ — A) + + ... + + £ i (11*1) 

421 



422 Part Two Multiple Linear Regression 


(Yi — — ptXi 、 




— p p —iX i .p—\'y 


( 11 . 5 ) 


<r{e}= 

iiy.ii 


o 


$ 


0 0 


0 




01 . 2 ) 


The estimation of the regression coefficients in generalized model (11.1) could be done by 
using the estimators in (6.25) for regression model (6.7) with equal error variances. These 
estimators are still unbiased and consistent for generalized regression model (11,1), but they 
no longer have minimum variance. To obtain unbiased estimators with minimum variance 
we must take into account that the different Y observations for the n cases no longer have 
the same reliability. Observations with small variances provide more reliable information 
about the regression function than those with large variances. We shall first consido 1 the 
estimation of the regression coefficients when the error variances of are known. This case 
is usually unrealistic, bat it provides guidance as to how to proceed when the etTor variances 
are not known. 

Error Variances Known 

When the eiTor variances err are known, we can use the method of maximum likelihood to 
obtain estimators of the regression coefficients in generalized regression model (I I ,l).The 
likelihood function in (6.26) for the case of equal error variances cr 2 is modified by replacing 
the cr 2 terms with the respective variances of and expressing the likelihood function in the 
first form of (1.26): 




( 2 丌 , 


T7 ： ex P 




(Yi ~ — 卩 iX “ 




一 卜 i X ,-./，— 丨 ）■ 


( 11 . 3 ) 


where P as usual denotes the vector of the regression coefficients. We define the reciprocal 
of the variance af as the weight w; ( - : 


w ；= 


( 11 . 4 ) 


<y. r 


We can then express the likelihood function (11.3) as follows, after making some 
simplifications: 


where: 


Ah _, fi P -\ are parameters 


X,i 


Xi. p —\ are known constants 


Si are independent A/(0, of) 




The variance-covaiiance matrix of the en or terms for the generalized multiple regression 
model (I I. I) is more complex than before: 



o 


o 


°l — 

a 



Chapter 11 Building the Regression Model III: Remedial Measures 423 


We find the maximum likelihood estimators of the regression coefficients by maximizing 
iO(p) in (11.5) with respect to po, 仇 ， ... ， ^ p -\. Since the error variances of and hence 
the weights are assumed to be known, maximizing L(P) with respect to the regression 
coefficients is equivalent to minimizing the exponential term: 

n 

Qw = {Y t — A) — — … 一 (” . 6 ) 

/ =1 

This term to be minimized for obtaining the maximum likelihood estimators is also the 
weighted least squares criterion, denoted by Q w . Thus, the methods of maximum likeli¬ 
hood and weighted least squares lead to the same estimators for the generalized multiple 
regression model (11.1), as is also the case for the ordinary multiple Egression model (6.7). 

Note how the weighted least squares criterion (11.6) generalizes the ordinary least squares 
criterion in (6.22) by replacing equal weights of 1 by Wi. Since the weight wi is inversely 
related to the variance of ， it reflects the amount of information contained in the observa¬ 
tion Y h Thus, an observation Y t that has a large variance receives less weight than another 
observation that has a smaller variance. Intuitively, this is reasonable. The more precise is 
Yj (i.e.，the smaller is cr f 2 )，the more information Yi provides about and therefore the 
, more weight it should receive in fitting the regression function. 

It is easiest to express the maximum likelihood and weighted least squares estimators of 
the regression coefficients for model (ll.l)in matrix terms. Let the matrix Wbea diagonal 
matrix containing the weights w t : 

^w { 0 … O' 

0 W<2_ . . • 0 

W - . (11.7) 

nxn * * • 

. 0 0 ••- W n . 

The normal equations can then be expressed as follows: 

(X f WX)b w = X'WY (11.8) 

and the weighted least squares and maximum likelihood estimators of the regression coef¬ 
ficients are: 

b = (X'WX) _l X'WY (11.9) 

px l 

where is the vector of the estimated regression coefficients obtained by weighted least 
squares. The variance-covariance matrix of the weighted least squares estimated regression 
coefficients is: 

i 

- G 2 {b w } = (X^WX)- 1 (11.10) 

pxp 

9 

Note that this variance-covariance matrix is known since the variances of are assumed to 
be known. - * 

The weighted least squares and maximum likelihood estimators of the regression co¬ 
efficients in (11.9) are unbiased, consistent, and have minimum variance among unbiased 
linear estimators. Thus, when the weights are known, bu ； generally exhibits less variability 
than the ordinary least squares estimator b. 



424 Part Two Multiple Linear Regression 


Many computer regression packages will provide the weighted least squares estimated 
regression coefficients. The user simply needs to provide the weights Wj. 


Error Variances Known up to Proportionality Constant 

We now relax the requirement that the variances a? are known by considering the case 
where only the relative magnitudes of the variances are known. For instance, if we know 
that is twice as large as a, 2 , we might use the weights W\ = 1, = 1/2 . In that case 

the relative weights w,~ are a constant multiple of the unknown true weights 1 /af： ' 

w,- = k 

where k is the proportionality constant. It can be shown that the weighted least squares and 
maximum likelihood estimators are unaffected by the unknown propoitionality constant k 
and are still given by (11.9). The reason is that the proportionality constant k appears on 
both sides of the normal equations (1 1.8) and cancels out. The variance-covariance matrix 
of the weighted least squares regression coefficients is now as follows: 

pxp 

This matrix is unknown because the proportionality constant k is not known. It can be 
estimated, however. The estimated variance-covariance matrix of the regression coefficients 
b u , is; 


01 . 12 ) 



s 2 {b ul } = MKX'WX) 一 1 

PXP 

where MSE W is based on the weighted squared residuals: 

MSE W ~ 

n — p n — p 

Thus, MSE^ here is an estimator of the proportionality constant k. 


- Yi) 2 Y, w i e J 


(11.13) 


(11.13a) 


Error Variances Unknown 

If the variances af were known, or even known up to a proportionality constant, the use of 
weighted least squares with weights w, would be straightforward. Unfortunately, one rarely 
has knowledge of the variances of. We are then forced to use estimates of the variances. 
These can be obtained in a variety of ways. We discuss two methods of obtaining estimates 
of the variances of. 

Estimation of Variance Function or Standard Deviation Function. The first method 
of obtaining estimates of the error term vai iances uf is based on empirical findings that the 
magnitudes of of and a； often vary in a regular fashion with one or several predictor variables 
X k or with the mean response £{K,}. Figure 3.4c, for example, shows atypical "megaphone' 

. prototype residual plot where cr? increases as the piedictor variable X becomes larger. Such 

a relationship between u} and one or several predictor variables can be estimated because 
the squared residual ef obtained from an ordinary least squares regression fit is an estirnate 
of cr- 2 , provided that the regression function is appropriate. We know from (A. 15a) lh at 



Chapter 11 Building the Regression Model III: Remedial Measures 425 


the variance of the error term £/, denoted by af, can be expressed as follows: 

^ = E{sf} - (£{ £i }) 2 (11.14) 

Since E{si } = 0 according to the regression model, we obtain: 

^ = E{sf} (11-15) 

Hence, the squared residual ef is an estimator of of. Furthermore, the absolute residual |e ( -1 
is an estimator of the standard deviation cr h since o - , = \y/af\. 

We can therefore estimate the variance function describing the relation of of to relevant 
predictor variables by first fitting the regression model usigg unweighted least squares 
and then regressing the squared residuals ef against the appropriate predictor variables. 
Alternatively, we can estimate the standard deviation function describing the relation of 
£T ； to relevant predictor variables by regressing the absolute residuals |e, | iobtained from 
fitting the regression model using unweighted least squares against the appropriate predictor 
variables. If there are any outliers in the data, it is generally advisable to estimate the standard 
deviation function rather than the variance function, because regressing absolute residuals 
is less affected by outliers than regressing squared residuals. Reference 11.1 provides a 
detailed discussion of the issues encountered in estimating variance and standard deviation 
functions. 

We illustrate the use of some possible variance and standard deviation functions: 

1. A residual plot against X\ exhibits a megaphone shape. Regress the absolute residuals 
against 

A. 

2. A residual plot against Y exhibits a megaphone shape. Regress the absolute residuals 
against Y. 

3. A plot of the squared residuals against X 3 exhibits an upward tendency. Regress the 
squared residuals against X 3 . 

4. A plot of the residuals against X 2 suggests that the variance increases rapidly with 
increases in X 2 up to a point and then increases more slowly. Regress the absolute 
residuals against X 2 and X|. 


After the variance function or the standard deviation function is estimated, the fitted 


values from this function are used to obtain the estimated weights: 

1 

Wi = ——- where 5 / is fitted value from standard deviation function 

0,) 2 


(11.16a) 


W} = — where 0/ is fitted value from variance function 
Vi 


(11.16b) 


The estimated weights are then placed in the weight matrix W in (11.7) and the estimated 
regression coefficients are obtained by (11.9), as follows: 


The weighted error mean square MSE W may be viewed here as an estimator of the propor¬ 
tionality constant ^ in (11.11). If the modeling of the variance or standard deviation function 
is done well, the proportionality constant will be near 1 and MSE W should then be near 1. 



426 Part Two Multiple Linear Regression 


We summarize the estimation process: 

1. Fit the regression model by unweighted least squares and analyze the residuals. 

2. Estimate the variance function or the standard deviation function by regressing eith^ 

the squared residuals or the absolute residuals on the appropriate predictor(s). 

3. Use the fitted values fro in the estimated variance or standard deviation function to cfctain 

the weights w,-. 

4. Estimate the regression coefficients using these weights. 

If the estimated coefficients differ substantially from the estimated regression coefficients 
obtained by ordinary least squares, it is usually advisable to iterate the weighted least squares 
process by using the residuals from the weighted least squares fit to reestimate the variance 
or standard deviation function and then obtain revised weights: Often one or two iterations 
are sufficient to stabilize the estimated regression coefficients. This iteration process is often 
called iteratively reweightecl least squcires. 

Use of Replicates or Near Replicates. A second method of obtaining estimates of th e 
error term variances cxr can be utilized in designed experiments where replicate observa¬ 
tions are made at each combination of levels of the predictor variables. If the number of 
replications is large, the weights Wi may be obtained directly from the sample variances of 
the V observations at each combination of levels of the X variables. Otherwise, the sample 
variances or sample standard deviations should first be regressed against appropriate pre¬ 
dictor variables to estimate the variance or standard deviation function, from which the 
weights can then be obtained. Note that each case in a replicate group receives the same 
weight with this method. 

In observational studies, replicate observations often are not present. Near replicates may 
then be used. For example, if the residual plot against X\ shows a megaphone ajpearance, 
cases with similar X| values can be grouped together and the variance of the residuals in 
each group calculated. The reciprocals of these variances are then used as the weights it )； 
if the number of replications is large. Otherwise, a variance or standard deviation function 
may be estimated to obtain the weights. Again, all cases in a near-replicate group receive 
the same weight. If the estimated regression coefficients differ substantially from those 
obtained with ordinary least squares, the procedure may be iterated, as when an estimated 
variance or standard deviation function is used. 

Inference Procedures when Weights Are Estimated. When the error variances cr 2 arff 
unknown so that the weights w, need to be estimated, which almost always is the case, 
the variance-covariance matrix of the estimated regression coefficients is usually 
by means of (11.13), using the estimated weights, provided the sample size is not very 
small. Confidence intervals for regression coefficients are then obtained by means of (6.50), 
with the estimated standard deviation s{b irk } obtained from the matrix (11.13). Confidence 
intervals for mean responses are obtained by means of (6.59), using s 2 {b„.} from (11.13) 
in (6.58). These inference procedures are now only approximate, however, because the 
estimation of the variances cx? introduces another source of variability. The approximatioff 
is often quite good when the sample size is not too small. One means of determining ； 
whether the approximation is good is to use bootstrapping, a statistical procedure that wu| 
be explained in Section 11.5. 



Chapter 11 Building the Regression Model III ： Remedial Measures 427 


(2) 

Diastolic 

Blood 

Pressure 

(3) 

(4) 

(5) 

⑹ 

% * 


1 咐 

氧 

Wi 

-73 

1.18 

1.18 

3.801 

.06921 

66 

-2.34 

2.34 

2.612 

.14656 

63 

-5.92 

5.92 

2,810 

.12662 

- * * • 

100 

* • • 

13,68 

13.68 

8.7S6 

.01 304 

80 * 

-9.80 

9.80 

9.944 

.01011 

109 

19:>8 

19,78 

9.746 

.01053 


TABLE 11.1 

Weighted Least 
Squares — 

Blood Pressure 
Example. 


Example 


Use of Ordinary Least Squares with Unequal Error Variances. If one uses b (not 
b w ) with unequal error variances, the ordinary least squares estimators of the regression 
coefficients are still unbiased and consistent, but they are no longer minimum variance 
estimators. Also, <r 2 {b} is no longer given by cr 2 (X / X) _1 . The correct variance-covariance 
matrix is ： 

cr 2 {b} = (X ; X) - 1 (XV 2 {e} X) (X f X) - ! 

If error variances are unequal and unknown, an appropriate estimator of cr 2 {b} can still be 
obtained using ordinary least squares. The White estimator (Ref. 11.2) is: 


S 2 {b} = (X / X)- 1 (X / S 0 X)(X / X)- 1 ^ 


where: 


So = 

nxn 


e\ 0 
0 e\ 

0 0 


i 


0 

0 

el 


and where e t ,.. ,,e„ are the ordinary least squares estimators of the residuals. White's 
estimator is sometimes referred to as a robust covariance matrix, because it can be used 
to make appropriate inferences about the regression parameters based on ordinary least 
squares, without having to specify the form of the nonconstant error variance. 

A health researcher, interested in studying the relationship between diastolic blood pressure 
and age among healthy adult women 20 to 60 years old, collected data on 54 subjects. A 
portion of the data is presented in Table 11.1，columns ] and 2. The scatter plot of the data 
in Figure 11.1a strongly suggests a linear relationship between diastolic blood pressure and 
age but also indicates that the error term variance increases with age. The researcher fitted a 
linear regression function by unweighted least squares to conduct some preliminaiy analyses 
of the residuals. The fitted regression function and the estimated standard deviations of bo 


__ 


(V Ag x/272122:525857 
ct 




428 Part Two Multiple Linear Regiv.ssion 


60 



20 


0 


10 20 30 40 50 60 

Age 


10 20 30 40 50 60 

Age 



10 20 


^ *i 


30 40 50 60 

Age 


and b\ are: 


Y = 56.157+ .58003 X 
(3.994) (.09695) 


01 . 18 ) 


The residuals are shown in Table 11.1, column 3, and the absolute residuals are presented in 
column 4. Figure 11.1a presents this estimated regression function. Figure I Mb presents 
a plot of the residuals against X, which confitnis the nonconstant error variance. A plot 
of the absolute residuals against X in Figure 11.1c suggests that a linear relation between 
the error standard deviation and X may be reasonable. The analyst therefore regressed the 
absolute residuals against X and obtained: 


1.54946 + .I98172X 


(11.19) 


Here, s denotes the estimated expected standard deviation. The estimated standard deviation 
function in (11.19) is shown in Figure 11.1c. 

To obtain the weights u){, the analyst obtained the fitted values from the standatd deviation 
function in (11.19). For example, for case 1, for which X| =27, the fitted value is: 




1.54946+ .198172(27) = 3.801 


The fitted values are shown in Table 11.1, column 5. The weights are then obtained by 
using (11.16a). For case 1, we obtain: 


Wi = 


(h) 2 (3-801): 


.0692 


The weights w； are shown in Table 11.1, column 6. 

Using these weights in a regression program that has weighted least squares capability, 
the analyst obtained the following estimated regression function: 


Y = 55.566 + .59634X 


( 11 . 20 ) 


FIGURE 11.1 Diagnostic Plots Detecting Unequal Error Variances — Blood Pressure Example. 


(a) Scatter Plot 


(b) Residual Plot against X 


(c) Absolute Residual 
Plot against X 


110 


20 r 


20 r 



5 0 5 
1 

lenpjss^elnlosqv 


o o o 


l^npjssy 



o o o o 

0 9 8 7 

1 

ajnssajdpoo- 



Chapter 11 Building the Regression Model III: Remedial Measures 429 


Note that the estimated regression coefficients are not much different from those in (11.18) 
obtained with unweighted least squares. Since the regression coefficients changed only a 
little, the analyst concluded that there was no need to reestimate the standard deviation 
function and the weights based on the residuals for the weighted regression in (11.20). 

The analyst next obtained the estimated variance-covariance matrix of the estimated 
regression coefficients by means of (11.13) to find the approximate estimated standard 
deviation s } = .07924. It is interesting to note that this standard deviation is somewhat 
smaller than the standard deviation of the estimate obtained by ordinary least squares 
in (11.18), .09695. The reduction of about 18 percent is the result of the recognition of 
unequal error variances when using weighted least squares. 

To obtain an approximate 95 percent confidence interval for 0% the analyst employed 
(6.50) and required ?(.975; 52) = 2.007. The confidence limits tKfen are .59634 ± 2.007 
(.07924) and the approximate 95 percent confidence interval is: ^ 

.437 <^< .755 J 

We shall consider the appropriateness of this inference approximation in Section 11.5. 


Comments 

1. The condition of the error variance not being constant over all cases is called heteroscedasticity, 
in contrast to the condition of equal error variances, called homoscedasticity. 

2. Heteroscedasticity is inherent when the response in regression analysis follows a distribution in 
which the variance is functionally related to the mean. (Significant nonnormality in Y is encountered 
as well in most such cases.) Consider, in this connection, a regression analysis where Xis the speed 
of a machine which puts a plastic coating on cable and Y is the number of blemishes in the coating 
per thousand feet of cable. If Y is Poisson distributed with a mean which increases as X increases, 
the distributions of Y cannot have constant variance at all levels of X since the variance of a Poisson 
variable equals the mean, which is increasing with X. 

3. Estimation of the weights by means of an estimated variance or standard deviation function or 
by means of groups of replicates or near replicates can be veiy helpful when there are major differences 
in the variances of the error terms. When the differences are only small or modest, however, weighted 
least squares with these approximate methods will not be particularly helpful. 

4. The weighted least squares output of some multiple r^ression software packages includes R 2 , 
the coefficient of multiple determination. Users of these packages need to treat this measure with 
caution, because R 2 does not have a clear-cut meaning for weighted least squares. 

5. The weighted least squares estimators of the regression coefficients in (11.9) for the case of 
known error variances af can be derived readily. The derivation also shows that weighted least squares 
maybe viewed as ordinary least squares of transformed variables. The generalized multiple regression 
model in ( 11 . 1 ) may be expressed as follows in matrix form: 

Y = Xp + e * (11.21) 

where: - * 

E{e} = 0 * 

a 2 {e} = W- 1 

Note that the variance-covariance matrix of the error terms in (11.2) is the inverse of the weight matrix 
defined in (11.7). 



430 Part Two Multiple Linear Regression 


We now define a diagonal matrix containing the square roots of the weights uj,- and denote it 
by W ,/2 : 


w l/2 ^ 

HXfJ 


\fw{ 0 

0 yub 

0 0 


0 _ 
0 


01 - 22 ) 


Note that W !/2 is symmetric and that W I/2 W !/2 W. The latter relation also holds for the corre¬ 
sponding inverse matrices: W— l/2 W— 1/2 = W— 

We premultiply the terms on both sides of regression model (11.21) by W !yf2 and obtain ； 



W I/2 Y = w l/2 xp + w ,/2 e 

01.23) 

which can be expressed as: 

— Xj + ej /， 

(11.23a) 

where: 

Y„, i W !/2 Y 



X u , = W I/2 X 

(11.23b) 


e 1(J = W l/2 e 


By (5.45) and (5.46), we obtain ： 

E{e u ,} = W l/2 E{e 卜 W !/2 0 = 0 

a 2 {e 1() } ^ W 1/2 a 2 {e}W !/2 = W 1/2 W _! W I/2 
— w s/2 W- l/2 W— !/2 W !/2 = I 


(11.24a) 


(11.24b) 


Thus, regression model (11.23a) involves independent error terms with metui zero and constant 
variance of 三 1 • We can therefore apply standard regression procedures to this transformed regression 
model. 

For example, the ordinary least squares estimators of the regression coefficients in (6.25) here 
become ： 


k = (xx w rXY 


w 


Using the definitions in (11,23b) n we obtain the result for weighted least squares given “i (11.9): 

b u , = [(WPxVW^Xr^wWxyW^Y 

=(X^WXrtWY 


6 . Weighted least squares is a special case of generalized least squares where the error terms not 
only may have different variances but pairs of error terms may also be correlated. 

7. For simple linear regression, the weighted least squares normal equations in (1 1.8) become: 


Wi i 7 ,- — Kv ^ Wi + K\ ^ W(X,- 

WiXiYt = b, M ^ WiXi + b wi ^ Wi xj 


(11.25) 



Chapter 11 Building the Regression Model III: Remedial Measures 431 


and the weighted least squares estimators and b w \ in (1K9) are; 


b w \ = 






Y^WiXi^iYi 

E 叫 • 




Y, W i Y i — b l!2 W i X i 


(11.26a) 


(11.26b) 


Note that if all weights are equal so uj, is identically equal to a constant, the normal equations (11.25) 
for weighted least squares reduce to the ones for unweighted least squares in (1.9) and the weighted 
least squares estimators (11.26) reduce to the ones for unweighted least squares in (L10). _ 


11.2 Multicollinearity Remedial Measures — Ridge Regression 

c — - - ^ ^ 

We consider first some remedial measures for serious multicollinearity that can be imple¬ 
mented with ordinary least squares，and then take up ridge regression, a method of over- 
/ coming serious multicollinearity problems by modifying the method of least squares. 

Some Remedial Measures 

1. As we saw in Chapter 7, the presence of serious multicollinearity often does not affect 
the usefulness of the fitted model for estimating mean responses or making predictions, 
provided that the values of the predictor variables for which inferences are to be made 
follow the same multicollinearity pattern as the data on which the regression model is based. 
Hence, one remedial measure is to restrict the use of the fitted regression model to inferences 
for values of the predictor variables that follow the same pattern of multicollinearity. 

2. In polynomial regression models, as we noted in Chapter 7, use of centered data 
for the predictor variable(s) serves to reduce the multicollinearity among the first-order, 
second-order, and higher-order terms for any given predictor variable. 

3. One or several predictor variables may be dropped from the model in order to lessen 
the multicollinearity and thereby reduce the standard errors of the estimated regression 
coefficients of the predictor variables remaining in the model. This remedial measure has two 
important limitations. First, no direct information is obtained about the dropped predictor 
variables. Second, the magnitudes of the regression coefficients for the predictor variables 
remaining in the model are affected by the correlated predictor variables not included in the 
model. 

t 

4. Sometimes it is possible tojidd some cases that break the pattern of multicollinearity. 

Often, however, this option is not available. In business and economics, for instance, many 
predictor variables cannot be controlled, so that new cases will tend to show the same 
intercorrelation patterns as the earlier ones. * 

5. In some economic studies, it is possible to estimate the regression coefficients for 
different predictor variables from different sets of data and thereby avoid the problems of 
multicollinearity. Demand studies, for instance, may use both cross-section and time series 
data to this end Suppose the predictor variables in a demand study are price and income, 



432 Part Two Multiple Linear Regression 


and the relation to be estimated is: 

~ Po + Pi + ^2^i2 + ( 11 . 27 ) 

where Y is demand, X, is income, and X 2 is price. The income coefficient may then be 
estimated from cross-section data. The demand variable F is thereupon adjusted ： 

Y ； — Yi - byX n (11.28) 

Finally, the price coefficient p 2 is estimated by regressing the adjusted demand variable Y' 
on 

6 . Another remedial measure for multicollineaiity that can be used with ordinary least 
squares is to form one or several composite indexes based on the highly correlated predictor 
variables, an index being a linear combination of the correlated predictor variables. The 
methodology of principal components provides composite indexes that are uncorrelated 
Often, a few of these composite indexes capture much of the information contained in 
the predictor variables. These few Linconelated composite indexes are then used in the 
regression analysis as predictor variables instead of the original highly correlated predictor 
variables. A limitation of principal components regression, also called latent root regression, 
is that it may be difficult to attach concrete meanings to the indexes. 

More information about these remedial approaches as well as about Bayesian regression, 
where prior information about the regression coefficients is incorporated into the estimation 
procedure, may be obtained from specialized works such as Reference 11.3. 

Ridge Regression 

Biased Estimation. Ridge regression is one of several methods that have been proposed to 
remedy multicollineaiity problems by modifying the method of least squares to allow biased 
estimators of the regression coefficients. When an estimator has only a small bias and is 
substantially more precise than an unbiased estimator, it may well be the preferred estimator 
since it will have a larger probability of being close to the true parameter value. Figure 11.2 
illustrates this situation. Estimator b is unbiased but imprecise, whereas estimator b R is 
much more precise but has a small bias. The probability that b R falls near the true value j8 
is much greater than that for the unbiased estimator b. 


FIGURE 11.2 
Biased 

Estimator with 
Small Variance 
May Be 
Preferable to 
Unbiased 
Estimator with 
Large 
Variance. 



P 


Parameter 



Chapter 11 Building the Regression Model III: Remedial Measures 433 


A measure of the combined effect of bias and sampling variation is the mean squared 
error，a concept that we encountered in Chapter 9 in connection with the C p criterion. 
Here, the mean squared error is the expected value of the squared deviation of the biased 
estimator b R from the true parameter As before, this expected value is the sum of the 
variance of the estimator and the squared bias: 

E{b R - j6} 2 = o 2 {b R } + {E{b R }-^) 2 (11.29) 

Note that if the estimator is unbiased, the mean squared error is identical to the variance of 
the estimator. t 

Ridge Estimators. For ordinary least squares, the normal equations are given by (6.24): 

(X y X)b = XY ^ (^11.30) 

When all variables are transformed by the correlation transformation (7.44), the transformed 
regression model is given by (7.45): 

= +段 x ; 2 + … ++ £ ； (1131) 

and the least squares normal equations are given by (7.52a): 

— ryx (11.32) 


where is the correlation matrix of the X variables defined in (7.47) and is the vector 
of coefficients of simple correlation between Y and each X variable defined in (7.48). 

The ridge standardized regression estimators are obtained by introducing into the least 
squares normal equations (11.32) a biasing constant c > 0, in the following form: 


i^xx + cl)b^ = Tyx 

where is the vector of the standardized ridge regression coefficients : 


(11.33) 




(p—1)3 




PU. 


(11.33a) 


and I is the (p — 1) x (p — 1) identity matrix. Solution of the normal equations (11.33) 
yields the ridge standardized regression cpefficients: 

= (r^ + (11-34) 


9 

The constant c reflects the amount of bias in the estimators. When c = 0, （ 11.34) reduces to 
the ordinary least squares regression coefficients in standardized form, as given in (7.52b). 
When c > 0, the ridge regression coefficients are biased but tend to be more stable (i.e., 
less variable) than ordinary least squares estimators. 


Choice of Biasing Constant c. It can be shown that the bias component of the total 
mean squared error of the ridge regression estimator increases as c gets larger (with all 

tending toward zero) while the variance component becomes smaller. It can further be 
shown that there always exists some value c for which the ridge regression estimator has 



434 Part Two Multiple Linear Regression 


a smaller total mean squared error than the ordinary least squares estimator b. The difficult 
is that the optimum value of c varies from one application to another and is unknown. 

A commonly used method of determining the biasing constant c is based on the ridg e 
trace and the variance inflation factors {VIF) k in (10.41). The ridge trace is a simultaneous 
plot of the values of the p — I estimated ridge standardized regression coefficients for 
different values of c, usually between 0 and I. Extensive experience has; indicated that the 
estimated regression coefficients may fluctuate widely as c is changed slightly from 0 
and some may even change signs. Gradually, however, these wide fluctuations cease and the 
magnitudes of the regression coefficients tend to move slowly toward zero as c is increased 
further. At the same time, the values of (VIF) k tend to fall rapidly as c is changed from 0, and 
gradually the (VIF) k values also tend to change only moderately as c is increased further. 
One therefore examines the ridge trace and the VIF values and chooses the smallest value 
of c where it is deemed that the regression coefficients first become stable in the ridge trace 
and the VIF values have become sufficiently small. The choice is thus a judgmental one. 

Examole In body fat example with three predictor variables in Table 7.1, we noted previously 

- several informal indications of severe multicGllinearity in the data. Indeed, in the fitted 

model with three predictor variables (Table 7.2d), the estimated regression coefficient b-? 
is negative even though it was expected that amount of body fat is positively related to 
thigh circumference. Ridge regression calculations were made for the body fat example 
data in Table 7.1 (calculations not shown). The ridge standardized regression coefficients 
for selected values of c are presented in Table 11.2, and the variance inflation factors are 
given in Table 11.3. The coefficients of multiple determination R 2 are also shown in the 
latter table. Figure 11.3 presents the ridge trace of the estimated standardized regression 
coefficients based on calculations for many more val ues of c than those shown in Table 11.2. 
To facilitate the analysis, the horizontal c scale in Figure 11.3 is logarithmic. 


TABLE 11.2 Ridge Estimated Standardized 
Regression Coefficients for Different Biasing 
Constants c 一 Body Fat Example with Three 
Predictor Variables. 


TABLE 11.3 VIF Values for Regression Coefficients 
and R 2 for Different Biasing Constants c — Body Fat 
Example with Three Predictor Variables. 


c 



hi 

c 

(VIF), 

(W) 2 

iVlfh 

R 2 

.000 

4.264 

-2.929 

-1.561 

.000 

708.84 

564.34 

104.61 

.8014 

.002 

1.441 

-.4113 

-.4813 

.002 

50.56 

40.45 

8.28 

.7901 

.004 

1.006 

-.0248 

-3149 

.004 

16*98 

13.73 

3.36 

7864 

.006 

.8300 

.1314 

-.2472 

.006 

8.50 

6.98 

2.19 

.7847 

.008 

.7343 

.2158 

-.2103 

.008 

5.15 

4.30 

1.62 

7838 

.010 

•6742 

.2684 

一 .1870 

.010 

3.49 

2.98 

1.38 

.7832 

.020 

.5463 

.3774 

-.1369 

.020 

1.10 

1.08 

1.01 

.7818 

.030 

.5004 

.4134 

-.1181 

.030 

.63 

.70 

.92 

7812 

.040 

.4760 

,4302 

-.1076 

.040 

.45 

.56 

.88 

.7808 

.050 

..4605 

-4392 

-.1005 

.050 

.37 

•49 

.85 

.7804 

.100 

.4234 

.4490 

-.0812 

.100 

.25 

.37 

.76 

.7784 

.500 

.3377 

.3791 

-.0295 

.500 

.15 

.21 

.40 

.7427 

1.000 

.2798 

.3101 

-.0059 

1.000 


.14 

.23 

.6818 




Chapter 11 Building the Regression Model III: Remedial Measures 435 


FIGURE 11.3 

Ridge Ttace of 

Estimated 
Standardized 
Regression 
Coefficients — 

Body Fat 
Example with 
Three 
predictor 

Vari^ es * 



Note the instability in Figure 11.3 of the regression coefficients for very small values 
of c. The estimated regression coefficient in fact, ^changes signs. Also note Ihe rapid 
decrease in the VIF values in Table 11.3. It was decided to employ c = .02 here because for 
this value of the biasing constant the ridge regression coefficients have VIF values near 1 
and the estimated regression coefficients appear to have become reasonably stable. The 
resulting fitted model fore = .02 is: 

Y* = .5463$ + 3114X* - .1369X ； 


Transforming back to the original variables by (7.53), we obtain: 

Y = -7.3978 + .5553X t + .3681X 2 - .1911X 3 

where Y = 20.195, X { = 25.305, X 2 = 51.170, X 3 = 27.620, 5 y = 5.10M, = 5.023, 
^2 — 5.235, and 53 = 3.647. 

The improper sign on the estimate for ^2 has now been eliminated, and the estimated 
regression coefficients are more in line with prior expectations. The sum of the squared 
residuals for the transformed variables, which increases with c, has only increased from 
.1986 at c = 0 to .2182 at c = .02 while R 2 decreased from .8014 to .7818. These changes 
are relatively modest. The estimated mean body fat when X hl = 25.0, Xh 2 = 50.0, and 
Xft 3 = 29.0 is 19.33 for the ridge regression at c = .02 compared to 19.19 utilizing the 
ordinary least squares solution. Thus, the ridge solution at c= .02 appears to be quite 
satisfactory here and a reasonable alternative to the ordinary least squares solution. 


Comments 

1. The normal equations (11.33) for the s ridge estimators are as follows: 

(1 + c)bf ^ r i2 b^ + --- + ix P -\b^_ x = r yl 

r 2\b^ + (1 + C)^2 + " ' 4" — r Y2 

* ^ : 

r P-U b \ + 1,2^ +•••+(! + c)b R p _^ = 


(11.35) 


where is the coefficient of simple correlation between the /th and jth X variables and r Y} is the 
coefficient of simple correlation between the response variable Y and the jth X variable. 




36 Part Two Multiple Linear Reg region 


2. VIF values for ridge regression coefficients bj^ are defined analogously to those for ordinal 
least squares regression coefficients. Namely，the VIF value for bf measures how large is the variance 
of relative to what the variance would be if the predictor variables were uncorrelated. It Can 
be shown that the VIF values for the ridge regression coefficients ;tre the diagonal elen 撕 ts of the 
following (/) — 1 ) x (/) — 1 ) matrix; 

(r xx +clr l r xx (r xx IcI)- 1 (11.36) 

3. The coefficient of multiple deterniination which for ordinary least squares is given in (6.40)- 



SSE 

SSTO 


(11-37) 


can be defined analogously for ridge regression. A simplification occurs, however, because the total 
sum of squares for the correlation - transformed dependent variable Y* in (7.44a) -is; 


SSTO, = ^ ( r ； - r) 2 = I (11.38) 

The fitted values with ridge regression are: 

Y ： =, b'l + ^ ■ + b 1 ^] 01.39) 

where the X’ k are the X variables transformed according to the correlation transformation (7.44b). 
The error sum of squares, as usual ， is: 


SSE h = -^) 2 (11.40) 

where Y* is given in (11.39). R 2 for ridge regression then becomes: 

R 2 r = 1 - SSE ti (11.41) 

4. Ridge regression estimates can be obtained by the method of penalized least squares. The 
penalized least squares criterion combines the usual sum of squared errors with a penalty for large 
regression coefficients: 




p- * 

E ( 巧 ) 


The penalty is a biasing constant, c, times the sum of squares of the regression coefficients. Large 
absolute regression parameters lead to a large penalty; thus, it can be seen that for r > 0 the “besf 
coefficients generally will be smallet- in absolute magnitude than the ordinary least squares estimates. 
For this reason, ridge estimators are sometimes referred to as shrinkage estimators. 

5. Ridge regression estimates tend to be stable in the sense that they are usually little affected by 
small changes in the data on which the fitted regression is based. In contrast, ordinary least squares 
estimates may be highly unstable uacler these conditions when the predictor variables are highly 
nuilticollineai'. Pi-edictions of new observations made from ridge estimated regression functions taid 
to be more precise than predictions made from oixiinary least squares regression functions when 
the predictor variables are correlated and the new observations follow the same multlcollinearity 
pattern (see, for instance. Reference 11.4). The prediction precision advantage with ridge regress 笔 on 
is especially great when the intercoiTelations among the predictor variables aie high. 

6 . Ridge estimated regression functions at times will provide good estimates of mean responses 
or predictions of new observations for levels of the predictor variables outside the region of the obser¬ 
vations on which the regression function is based. In contrast, estimated regression functions based 
on ordinary Seast squ^-es may perform quite poorly in such cUministances. Of course, any estimation 
or prediction well outside the region of the observations should always be made with great caution. 



Chapter 11 Building the Regression Model III: Remedial Measures 437 


7. A major limitation of ridge regression is that ordinary inference procedures are not applicable 
and exact distributional properties are not known. Bootstrapping, a computer-intensive procedure to 
be discussed in Section 11.5, can be employed to evaluate the precision of ridge regression coefficients. 
Another limitation of ridge regression is that the choice of the biasing constant c is a judgmental one. 
Although a variety of formal methods have been developed for making this choice, these have their 
own limitations. 

8. The ridge regression procedures have been generalized to allow for differing biasing constants 
for the different estimated regression coefficients; see’ for instance. Reference 11.3. 

9. Ridge regression can be used to help in reducing the number of potential predictor variables in 

exploratory observational studies by analyzing the ridge trace. Variables whose ridge trace is unstable, 
with the coefficient tending toward the value of zero, are dropped with this approach. Also, variables 
whose ridge trace is stable but at a veiy small value are dropped. Finally, variabjes with unstable ridge 
traces that do not tend toward zero are considered as candidates for dropping.'; _ 

11.3 Remedial Measures for Influential Cases—Robust Regression 

We noted in Chapter 10 that the hat matrix and studentized deleted residuals are valuable 
tools for identifying cases that are outlying with respect to the X and Y variables. In 
addition, we considered there how to measure the influence of these outlying cases on 
the fitted values and estimated regression coefficients by means of the DFF1TS, Cook’s 
distance, and DFBETAS measures. The reason for our concern with outlying cases is that 
the method of least squares is particularly susceptible to these cases, resulting sometimes 
in a seriously distorted fitted model for the remaining cases. A crucial question that arises 
now is how to handle highly influential cases. 

A first step is to examine whether an outlying case is the result of a recording error, 
breakdown of a measurement instrument, or the like. For instance, in a study of the waiting 
time in a telephone reservation system, one waiting time was recorded as 1,000 rings. This 
observation was so extreme and unrealistic that it was clearly erroneous. If erroneous data 
can be corrected, this should be done. Often, however, erroneous data cannot be corrected 
later on and should be discarded Many times, unfortunately, it is not possible after the 
data have been obtained to tell for certain whether the observations for an outlying case are 
erroneous. Such cases should usually not be discarded. 

If an outlying influential case is not clearly erroneous, the next step should be to examine 
the adequacy of the model. Scientists frequently have primary interest in the outlying cases 
because they deviate from the currently accepted model. Examination of these outlying 
cases may provide important clues as to how the model needs to be modified. In a study 
of the yield of a process, a first-order model was fitted for the two important factors under 
consideration because previous studies had not found any interaction effects between these 
factors on the yield. One case in the current study was outlying and highly influential, 
with extremely high yield; it corresponded to unusually high levels of the two factors. The 
tentative conclusion drawn was that an interaction effect is present; this was subsequently 
confirmed in a follow-up study. The improved model, resulting^from the outlying case, led 
to greatly improved process productivity. 

Outlying cases may also lead to the finding of other types of model inadequacies, such as 
the omission of an important variable or the choice of an incorrect functional form (e.g., a 
quadratic function instead of an exponential function). The analysis of outlying influential 


438 Part Two Multiple Linear Regression 


cases can frequently lead to valuable insights for strengthening the model such that the 
outlying case is no longer an outlier but is accounted for by the model. 

Discarding of outlying influential cases that are not clearly erroneous and that cannot be 
accounted for by model improvements should be done only mrely, such as when the model 
is not intended to cover the special circumstances related to the outlying cases. For example 
a few cases in an industrial study were outlying and highly influential. These cases occurred 
early in the study, when the plant was in transition from one process to the new one under 
study. Discarding of these early cases was deemed to be reasonable since the model was 
intended for use after the new process had stabilized. 

An alternative to discarding outlying cases that is less severe is to dampen the influence 
of these cases. That is the purpose of robust regression. 


Robust Regression 

Robust regression procedures dampen the influence of outlying cases, as compared to 
ordinary least squares estimation, in an effort to provide a better fit for the majority of 
cases. They are useful when a known, smooth regression function is to be fitted to data that 
are “noisy，” with a number of outlying cases, so that the assumption of a normal distribution 
for the error terms is not appropriate. Robust regression procedures are also useful when 
automated regression analysis is required. For example, a complex measurement instrument 
used for internal medical examinations must be calibrated for each use. There is no time for a 
thorough identifiaition of outlying cases and an analysis of their influence, nor for a careful 
consideration of remedial measures. Instead, an automated regression calibration mist be 
used. Robust regression procedures will automatically guard against undue influence of 
outlying cases in this situation. 

Numerous robust regression procedures have been developed. They are described in 
specialized texts, such as References 11.5 and 11.6. We mention briefly a few of these pro¬ 
cedures and then describe in more detail one commonly used procedure based on iteratively 
reweighted least squares. 

LAR or LAD Regression. Least absolute residuals (LAR) or least absolute deviations 
(LAD) regression, also called minimum L \ -norm regression, is one of the most widely used 
robust regression procedures. It is insensitive to both outlying data values and inadequacies 
of the model employed. The method of least absolute residuals estimates the regression 
coefficients by minimizing the sum of the absolute deviations of the Y observations from 
their means. The criterion to be minimized, denoted by L |, is: 

n 

一 Em —(A) + A 欠 n + '. • + (11.42) 

/ =i 

Since absolute deviations rather than squared ones are involved here, the LAR method 
places less emphasis on OLitlying observations than does the method of least squares. 

. The estimated LAR regression coefficients can be obtained by linear progtainming tech¬ 

niques. Details about computational aspects may be found in specialized texts, such as 
Reference 11.7. The LAR fitted regression model differs from the least squares fitted model 
in that the residuals ordinarily will not sum to zero. Also, the solution for the estimated 
regression coefficients with the method of least absolute residuals may not be unique. 



Chapter 11 Building the Regression Model III: Remedial Measures 439 


IRLS Robust Regression. Iteratively reweighted least squares (IRLS) robust regression 
uses the weighted least squares procedures discussed in Section 11.1 to dampen the influence 
of outlying observations. Instead of weights based on the error variances, IRLS robust 
regression uses weights based on how far outlying a case is, as measured by the residual for 
that case. The weights are revised with each iteration until a robust fit has been obtained. 
We shall discuss this procedure in more detail shortly. 

LMS Regression. Least median of squares (LMS) regression replaces the sum of squared 
deviations in ordinary least squares by the median of the squared deviations, which is a robust 
estimator of location. The criterion for this procedure is to minimize the median squared 
deviation: 

median {[F,- — (j6o + Pi 1 + …+ Pp~i^i,p-^)] 2 } (11.43) 

with respect to the regression coefficients. Thus, this procedure leads to estimated regression 
coefficients b^b\,, b p —i that minimize the median of the squared residuals, t 

Other Robust Regression Procedures. There are many other robust regression pro¬ 
cedures. Some involve trimming one or several of the extreme squared deviations before 
applying the least squares criterion; others are based on ranks. Many of the robust regression 
procedures require extensive computing. 


IRLS Robust Regression 

Itemtively reweighted least squares was encountered in Section 11.1 as a remedial measure 
for unequal error variances in connection with the obtaining of weights from an estimated 
variance or standard deviation function. For robust regression, weighted least squares is 
used to reduce the influence of outlying cases by employing weights that vary inversely 
with the size of the residual. Outlying cases that have large residuals are thereby given 
smaller weights. The weights are revised as each iteration yields new residuals until the 
estimation process stabilizes. A summary of the steps follows: 

1. Choose a weight function for weighting the cases. 

2. Obtain starting weights for all cases. 

3. Use the starting weights in weighted least squares and obtain the residuals from the fitted 
regression function. 

4. Use the residuals in step 3 to obtain revised weights. 

5. Continue the iterations until convergence is obtained. 


We now discuss each of the steps in IRLS robust regression. 


Weight Function. Many weight functions have been proposed for dampening the influ¬ 
ence of outlying cases. Two widely used weight functions are the Huber and bisquare weight 
functions: w 


1 

Huber: w = <. 1.345 

[ Au\ 


\u\< 1.345 
\u\ > 1.345 


Bisquare: 


w = 


u 


2i 


4.685 


\u\ < 4.685 
\u\ > 4.685 


(11.44) 


(11.45) 



440 Part Two Multiple Linear Regression 


FIGURE 11.4 

Two Weight 
Functions Used 
in IRLS Robust 
Regression. 


(a) Huber Weight Function 


[1 |w| < 1.345 

w = < 

U.345/|tj| |u| > 1.345 


(b) Bisquare Weight Function 
f[l - (u/4.685) 2 ] 2 |(j| ^ 4 685 

I 0 M > 4.685 



As before, w denotes the weight, and it denotes the scaled residual to be defined shortly 
The constant 1.345 in the Huber weight function and the constant 4.685 in the bisquare 
weight function are called tuning constants. They were chosen to make the IRLS rc^ust 
procedure 95 percent efficient for data generated by the normal error regression model (6 7 ) 
Figure 11.4 shows graphs of the two weight funetions. Note how the weight w according 
to each weight function declines as the absolute scaled residual gets larger, and that each 
weight function is symmetric around li = 0 . Also note that the Huber weight function does 


not reduce the weight of a case from 1.0 until the absolute scaled residual exceeds I .345 
and that all cases receive some positive weight, no matter how large the absolute scaled 
residual. In contrast, the bi square weight function reduces the weights of all cases from 
1.0 (unless the residual is zero). In addition, the bisquare weight function gives weight 0 
to all cases whose absolute scaled residual exceeds 4.685, thereby entirely excluding these 
extreme cases. 


Starting Values. Calculations with some of the weight functions are very sensitive tD the 
starting values; with others, this is less of a problem. When the Huber weight function is 
employed, the initial residuals may be those obtained from an ordinary least squares fit 
The bisquare function calculations, on the other hand, are more sensitive to the starting 
values. To obtain good starting values for the bisquare weight function, the Huber weight 
function is often used to obtain an initial robust regression fit, and the residuals for this fit 
are then employed as starting values for several iterations with the bisquare weight function. 
Alternatively, least absolute residuals regression !n (11.42) may be used to obtain starting 
residuals when the bisquare weight function is used. 

Scaled Residuals. The weight functions (11.44) and (1 丨 .45) are each designed to be used 


with scaled residLLals. The semistudentized residuals in (3.5) are scaled residuals and could 
be employed. However, in the presence of outlying observations, ^/MSE is not a resistant 
estimator of the error term standard deviation cr; the magnitude of yjMSE can be greatly 
influenced by one or severiil outlying observations. Also, -JMSE is not a robust estimator 
of a when the distribution ot* the error terms is far from normal. Instead, the res'tsuuit and 
robust median absolute deviation {MAD) estimator is often employed: 


MAD 


.6745 


median{|^,- — median{e,}|} 


(11.46) 


The constant .6745 provides an unbiased estimate of a for independent observations from 
a normal distribution. Here, it serves to provide an estimate that is approximately unbiased. 



Chapter 11 Building the Regression Model III: Remedial Measures 441 


The scaled residual u t based on (11.46) then is: 


g/ 

MAD 


(11.47) 


Number of Iterations. The iterative process of obtaining a new fit, new residuals and 
thereby new weights, and then refitting with the new weights continues until the process 
converges. Convergence can be measured by observing whether the weights change rela¬ 
tively little, whether the residuals change relatively little, whether the estimated regression 
coefficients change relatively little, or whether the fitted values change relatively little. 


Example 1 : 

Mathematics 
Profidency 
with One 
predictor 


The Educational Testing Service Study America’s Smallest School: The Family (Ref. 11.8) 
investigated the relation of educational achievement of student%to their home environ¬ 
ment. Although earlier studies examined the relation of educational achievement to family 
socioeconomic status (e.g” parents’ education, family income, parents * occupation), this 
study employed more direct measures of the home environment. Specifically, relation 
of educational achievement of eighth-grade students in mathematics to the following five 
explanatory variables was investigated: 

PARENTS (X t ) ― percentage of eighth-grade students with both parents living at home 

HOMELIB (X 2 ) — percentage of eighth-grade students with three or more types of 
reading materials at home (books, encyclopedias, magazines, newspapers) 

READING (X 3 ) — percentage of eighth-grade students who read more than 10 pages 
a day 

TVWATCH (X 4 ) — percentage of eighth-gmde students who watch TV for six hours or 
more per day 

ABSENCES (X 5 ) — percentage of eighth-gmde students absent three days or more last 
month 


Data on avemge mathematics proficiency (MATHPROF) and the home environment 
variables were obtained from the 1990 National Assessment of Educational Progress for 
37 states, the District of Columbia, Guam, and the Virgin Islands. A portion of the data is 
shown in Table 11.4. 

Our first example of robust regression using iteratively reweighted least squares involves 
only one predictor, HOMELIB (X 2 ). In this way, simple plots can be used to present the 
data and the fitted regression function. 

Figure 11.5a presents a scatter plot of the data, together with a plot of a first-order 
(simple linear) regression model fit by ordinary least squares and a lowess smooth. The 
lowess smooth suggests that the relationship between home reading resources and average 
mathematics proficiency is curvilinear—ppossibly second order — for the majority of states, 
but three points are clear outliei^. The District of Columbia and the Virgin Islands are outliers 
with respect to mathematics proficiency (F), and Guam appears to be an outlier with respect 
to both mathematics proficiency and available reading resources (X). Figure ] 1.5b presents 
a plot against X of the residuals obtained from the fitted first-order model in Figure 11.5a. 
This plot shows clearly the three outlying Y cases. Note also from the residual plot that 
there is a group of six states with low reading resources levels, between 68 and 73, whose 
avemge mathematics proficiency scores are all above the fitted regression line. This is 
another indication that a second-order polynomial model may be appropriate. 



442 Part Two Multiple Linear Regression 


FIGURE 11.5 
Comparison 
of Lowess, 
Ordinary Least 
Squares Fits, 
and Robust 
Quadratic 
Fits — 

Mathematics 

Proficiency 

Example. 


(a) Lowess and Linear Regression Fits 



(c) OLS Quadratic Fit 



300 - 
280 - 
260 - 
240 - 
220 - 


1 


(e) Robust Quadratic Fit 







Chapter 11 Building the Regression Model III: Remedial Measures 443 


TABLE”- 4 


Data Set — Mathematics Proficiency Example. 


(State 


MATHPROF 

Y 


PARENTS 


HOMELIB 

X 2 


逢 ， ma 

252 

75 

78 

購 

259 

75 

73 

娜瞒 

256 

77 

5 j 

77 

\ .• - v 

256 

78 

68 

h 祕 

231 

47 

76 

j » 

• • • 

* H 

• • • 

fGtiam 

231 

81 

64 

r * , • » 

； 

s »■ -• • 

f m. Tex as 

258 

77 

70 

⑬ _irtlslands 

218 

63 

76 


264 

78 

82 

ffesti/irginra 

256 

82 

80 

I S^sconsjfi 

274 

81 

86 

1 廳泰發 min 与 

272 

85 

86 


READING 

34 

4t 

： 28 ' 

42 

_ 

24 


TVWATCH 



m 

s U 

- 20 

11 



33 


32 

34 

23 

36 

■3iS 

43 



W 

16 

8 

7 


Source: ETS Pc^icy Information Center, Atnerica’s Smallest SchooU The Family (Princeton, New Jersey: Educational Testing Service, 1992), 


ABSENCES 

% 

18 

26 

: ：-.5 - 

513 

?8 


37 

28 
*--•> • 

18 

^22 

24 

25 
21 
23 


Second-order model ( 8 . 2 ): 


K = A) + ^2-^i2 + ^22xf 2 + £ i (11 -48) 

was next fit, again using ordinary least squares. Recall that this model requires calculation 
of the centered predictor x i2 = X i2 — X i2 and its square, xf 2 . A plot of the fit of the second- 
order model, superimposed on a scatter-plot of the data, is shown in Figure 1 h5c. Though 
improved, the fit is again unsatisfactory: the six points that fell above the first-order fit are 
still above the fitted second-order model. The regression line is clearly being influenced by 
the three outliers identified above. The Cook’s distance measures for the second-order fit 
are displayed in an index plot in Figure 11.5d. The plot confirms the influence of Guam and 
the Virgin Islands. 

In an effort to dampen the effect of the three outliers, we shall fit second-order model (8.2) 
robustly, using iteratively reweighted least squares and the Huber weight function (11.44). 
We illustrate the calculations for case 1, Alabama. The regression model to be fitted is the 
first-order model. An ordinary least squares fit of this model yields: 

Y = 258.436' + 1.8327^ + 0.0649 (11.49) 

The residual for Alabama is = —2.4109. The residuals are shown in Column 1 of 
Table 11.5. The median of the 40 residuals is median{ej} = 0.7063. Hence, — medianfe；}= 
— 2.4109 — 0.7063 = —3.1172, and the absolute deviation is \e { — median{e,.}| = 3.1172. 
The median of the 40 absolute deviations is: 

median{ I e,- — median{e/} |} = 3 _ 1488 




444 Part Two Multiple Linear Re^ressiou 


TABLE 11,5 Iteratively Hu her- Re we i ghted Least Squares Caicuiations — Mathematics Proficiency Example 

( 1 ) ( 2 ) 


⑶ (4) 

iteration 1 


(5) (6) 

iteration 2 


⑺ ⑻ 

Iteration 7 


Iteration 0 


/ 



1 

—2.4109 

-0.51643 

2 

10.5724 

2.26466 

3 

3.0454 

0.65234 

4 

10.3104 

2.20853 

8 

-20.6282 

—4.41 866 

11 

-14.8358 

一 3.1 7791 

36 

-33.6282 

— 7.20333 

37 

2.4659 

0.52821 

38 

-1.7129 

-0.36691 

39 

3.2658 

0.69954 

40 

1.2658 

0.27113 


w-, 

e! 

W} 

1.00000 

-3.7542 

1.00000 

0.59391 

8.4297 

0.71515 

1.00000 

1.5411 

1.00000 

0.60900 

7.3822 

0.81663 

0.30439 

-22.2929 

0.27042 

0.42323 

-18.3824 

0.32795 

0.18672 

-35.2929 

0.17081 

1.00000 

1.7722 

1.00000 

1.00000 

-2.7325 

1.00000 

1.00000 

3.2305 

1.00000 

1.00000 

1.2305 

1.00000 


e； 

Wj 


-4.0354 

1.00000 

-4.1269 

7.4848 

0.86011 

6.7698 

1.1559 

1.00000 

0.97B1 

5.4138 

1.00000 

3.6583 

-22.7964 

0.25263 

-23.0873 

-21.4287 

0.24019 

—24.3167 

-35.7964 

0.16161 

-36.0873 

"1.7627 

1.00000 

1.8699 

一 2.8490 

1.00000 

-2.8079 

3.2624 

1.00000 

3.3014 

1.2624 

1.00000 

13014 


so that the MAD estimator (11.46) is: 


MAD 


3.1488 

T674? 


= 4.6683 


Hence, the scaled residual (11.47) far Alabama is: 


-2.4109 


it I = 


4.6683 


-.5164 


The scaled residuals are shown in Table 11.5, column 2. Since |wi| 二 .5164 < 1.345, the 
initial Hubei - weight for Alabama is = 1.0. The initial weights are shown in Table 11.5, 
column 3. Tc interpret these weights, remember that ordinaiy least squares may be viewed 
as a special case of weighted least squares with the weights for all cases being equal to 1. 
We note in column 3 that the initial weights for cases 8, 11. and 36 (District cf Columbia, 
Guam, and Virgin Islands) are substantially reduced, and that the weights for some other 
states are reduced sumewhctt. 

The first iteration of weighted least squares uses the initial weights in column 3, leading 
to the fitted regression model: 


Y = 259.390+ 1 .6701 ao + 0.06463.\; (11.50) 

This fitted regression function differs considerably from the ordinary least squares fit 
in (11.49). The coefficient of x 2 has decreased from b 2 — 1.8327 to b 2 — 1.6701, while the 
curvature term b 2 2 = 0.06463 changed little from its previous value of b 22 ― 0.06491. This 
has permitted the estimated regression function to increase for smaller values of X 2 and to 
therefore conform more closely to the six values that previously fell above tlie fitted line. 

Iteration 2 uses the residuals in column 4 of Table 11.5, scales them, and obtains revised 
Huber weights, which are then used in iteration 2 of weighted least squares. The weights 




Chapter 11 Building the Regression Model III: Remedial Measures 445 


Example 2: 
Mathematics 
Proficiency 
with Five 
Predictors 


obtained for the eighth iteration differed relatively little from those for the seventh iteration; 
hence the itemtion process was stopped with the seventh iteration. The final weights are 
shown in Table 11.5, column 7. Note that only minor changes in the weights occurred 
between itemtions 2 and 7. Use of the weights in column 7 leads to the final fitted model: 

Y = 259.421 + 1 .5649 处 + 0.08016# (11.51) 

The residuals for the final fit are shown in Table 11.5, column 8. Just as the weights changed 
only modemtely between iterations 2 and 7, so the residuals changed only to a small extent 
after iteration 2. Note that the coefficient of the curvature term did change a bit more 
substantially — from ^22 = .06463 to ^22 = .08016. ^ 

Figure 11.5e shows the scatter plot and the IRLS fitted second-order regression function, 
and Figure 11.5f contains an index plot of the weights used in the final iteration. The robust 
fit now tracks the responses to the 37 states extremely well, and the fit to the six case^ that 
were previously above the regression line is now satisfactory. The plot of the final weights 
in Figure 11.5f shows clearly the downweighting of the three outliers. 

We conclude from the robust fit in Figure 11.5e that there is a clear upward-curving 
relationship between availability of reading resources in the home and average mathematics 
proficiency at the state level. This does not necessarily imply a causal relation, of course. 
The availability of reading resources may be positively correlated with other variables that 
are causally related to mathematics proficiency. 

We shall explore from a descriptive perspective the relationship between average mathemat¬ 
ics proficiency and the five home environment variables. A MINITAB scatter plot matrix of 
the data is presented in Figure 11.6a and the correlation matrix is presented in Figure 11.6b. 
The scatter plot matrix also shows the lowess nonparametric regression fits, where q = .9 
(the proportion defining a neighborhood) is used in the local fitting. 

We see from the first row of the scatter plot matrix that average mathematics proficiency 
is related to each of the five explanatory variables and that there are three clear outliers. 
They are District of Columbia, Guam, and Virgin Islands, as noted earlier in this section. 
The lowess fits show positive relations for PARENTS, HOMELIB, and READING and a 
negative relation for ABSENCES. The lowess fit for TV WATCH is distorted because of the 
outliers. If these are ignored, the relation is negative. The correlation matrix shows fairly 
strong linear association with average mathematics proficiency for all explanatory variables 
except ABSENCES, where the degree of linear association is moderate. 

The relationships with mathematics proficiency found in Figure 11.6amustbe interpreted 
with caution. We see from the remainder of the scatter plot matrix and from the correlation 
matrix in Figure 11.6b that the explanatory variables are correlated with each other, some 
fairly strongly. Also, some of the explanatory variables are correlated with other important 
variables not considered in this study. For example, the percentage of students with both 
parents at home is related to family income. " 

For simplicity, we consider only first-order terms in this example. An initial fit of the 
first-order model to the data using ordinary least squares yields the following estimated 
regression function: 


Y= 155.03 + .3911X l + M39X 2 + -3616X 3 - .8467X 4 + .1923X 5 (11.52) 



446 Part Two Multiple Linear Regression 


FIGURE 11.6 


Scatter Plot 
Matrix with 
Lowess 


Smooths, and 
Correlation 



Mathematics 

Proficiency 

Example. 


⑻ SYCRAPH Scatter Plot Matrix 


MATHPROF 




产 

• » 


PARENTS 




■ 

V 

j 

■ —i.' ■'' > 

* 

s 

* 

HOMELIB 

_i 


• •• ■ 

»* 

• 

. 

， 

■ 

■ 

■ 

■ 

■ 

READING 

* 

•二 * 

■ 

■ 

■ 


TVWATCH 

•«*‘**** 

■ 

■ 

■ 

■ 


■ 

ABSENCES 


(b) Correlation Matrix 


PARENTS 

MATHPROF 

0.741 

PARENTS 

HOMELIB 

0.745 

0.395 

READING 

0.717 

0.693 

TVWATCH 

-0.873 

一 0.831 

ABSENCES 

— 0*480 

-0.565 


HOMELIB 

READING 

TVWATCH 

0.377 

一 0.594 

-0.792 


-0.443 

-0.357 

0.512 


The signs of the regression coefficients, except for bs, are in the expected directions. The 
coefficient of multiple determination for this fitted model is R 2 — .86, suggesting that the 
explanatory variables are strongly related to average mathematics proficiency. 

Table 11.6 presents some diagnostics for the fitted model in (11.52): leverage hu, studen- 
tized deleted residual and Cook’s distance D ( . We see that the District of Columbia, Guam. 
Texas, and Virgin Islands have leverage values equal to or exceeding 2p/n = = 12/40= 30 







Chapter 11 Building the Regression Model III: Remedial Measures 447 


TABLE 11.6 

jliagnostics for 

1 

State 

hn 

f? 

Di 

First-Order 

1 

Alabama 

•16 

一 .05 

•00 

Model with 

2 

Arizona 

•19 

•40 


All Five 
Explanatory 

3 

Arkansas 

.16 

1.41 

.06 

4 

California 

.29 

.10 

po 

Variabte® — 

… 

… 

* * * 


* * 

Mathematics 

8 

D.C. 

•69 

1.41 

•72 

proficiency 

' * * 

* • • 

* * * 

• • « 

* * * 

Example- 

11 

Guam 

.34 

-2.83 

.57 


35 

Texas 

.30 

2.25 

33 


36 

VirginJslands 

.32 

-5.21 

1.21 


37 

Virginia 

.06 

.90 

.01 


38 

West_Virginia 

.13 

^.91 

.02 


39 

Wisconsin 

•08 

39 

j *00 


40 

Wyoming 

.08 

-.91 

.01 


% 




We also see that the Virgin Islands is outlying with respect to its Y value; the absolute 
value of its studentized deleted residual r 36 = —5.21 exceeds the Bonferroni critical value 
at 0 ： = .05 of t{\ — a/7n\ n ~ p — 1) = r(.99938; 33) = 3.53. Of these outlying cases, 
the Virgin Islands is clearly influential according to Cook’s distance measure, and Dis¬ 
trict of Columbia and Guam are somewhat influential; the 50th percentile of the F{6, 34) 
distribution is .91, and the 25th percentile is .57. „ 

Residual plots against each of the explanatory variables and against P (not shown here) 
presented no strong indication of nonconstancy of the error variance for the states aside from 
the outliers. Since the explanatory variables are correlated among themselves, the question 
arises whether a simpler model can be obtained with almost as much descriptive ability as 
the model containing all five explanatory variables. Figure 11.7 presents the MINITAB best 
subsets regression output, showing the two models with highest R 2 for each number of X 
variables. We see that the two best models for three variables (p = 4 parameters) contain 
relatively little bias according to the Q, criterion and have R 2 values almost as high as the 
model with all five variables. 

We explore now one of these two models, the one containing HOMELIB, READING, 
and TVWATCH. In view of the outlying and influential cases, we employ IRLS robust 
regression with the Huber weight function (11.44). We find that after eight iterations, the 
weights change very little, so the iteration process is ended with the eighth iteration. The 
final robust fitted regression function is: 

Y = 207.83 + .7942X 2 + ,1637X 3 - 1.1695X 4 (11.53) 

The signs of the regression coefficients agree with expectations. For comparison, the re¬ 
gression function fitted by ordinary least squares is: 

Y = 199.61 + ‘7804X 2 + .4012X 3 - 1.1565X 4 (11.54) 



448 Part Two Multiple Unear Regivssion 


FIGURE 11.7 Best Subsets Regression of MATHPRDF 

MINITAB Best 
Subsets 
Regression — 

Mathematics 

Proficiency 

Example. 

Adj. 

Vars R-sq R-sq C-p 


1 

76.3 

75.7 

22.0 

6.5079 

X 

1 

55.5 

54.3 

72.8 

8.9157 

X 

2 

84.2 

83.4 

4.6 

5.3810 

X X 

2 

79.2 

78.1 

16.8 

6.1743 

X X 

3 

85.1 

83.9 

4.4 

5.2939 

XXX 

3 

85.1 

83.8 

4.5 

5.3062 

X X X 

4 

85.9 

84.3 

4.5 

5.2327 

X X X X 

4 

85.4 

83.7 

5.8 

5.3285i 

X X X X 

5 

86.1 

84.1 

6.0 

5.2680 

X X X X X 


Notice that the robust regression led to a deemphasis of (READING), with the otho 1 
regression coefficients remaining almost the same. 

To obtain an indication of how well the robust regression model (11.53) describes the 
relation between average mathematics proficiency of eighth-grade students and the three 
home environment variables, we have t anked the 40 states according to their average math¬ 
ematics proficiency score and according to their corresponding fitted value. The Spearman 
rank cotTelation coefficient (2.97), is .945. This indicates a fairly good ability of the three 
explanatory variables to distinguish between states whose average mathematics proficiency 
is very high or very low. 

The analysis of the mathematics proficiency data set in Table 11.4 presented here is by 
no means exhaustive. We have not analyzed higher-order effects, nor have we explored 
other subsets that might be reasonable to use. We have not recognized that the precision of 
the state data varies because the data are based on samples of different sizes, nor have we 
considered other explanatory variables that are related to mathematics proficiency, such as 
parents’ education and family income. Furthermore, we have analyzed state averages, which 
may obscure impottant insights into iclations between the variables at the family level. 

Comments 

1. Robust regression requires knowledge of the regression function. When the appropriate re¬ 
gression function is not clem.，nonparametric regression may be useful. Nonparametric regression is 
discussed in Section 1 1.4. 

2. Robust regression cun be employed to identify outliers in situations where there are multiple 
outliers whose presence is masked with diagnostic measures that delete one case at a time. Cases 
whose final weights are relatively small are outlying. 

3. As illustrated by the mathematics proficiency example, robust regression is often useful for 
confirming the reasonableness of ordinary least squares results. When robust regression yields similar 
results to ordinary least squares (for example, the residuals are similar), one obtains some reassurance 
that ordiniiry least squares is not unduly influencecl by outlying cases. 


ABSENCES 
T V w A T c H 
READING 
H D M E L I B 
PARENTS 



chapter 11 Building the Regression Model III ： Remedial Measures 449 


4. A limitation of robust regression is that the evaluation of the precision of the estimated r^ression 
coefficients is more complex than for ordinary least squares. Some large^sample results have beai 
obtained (see, for example, Reference 11.5), but they may not perform well in the presence of outliers. 
Bootstrapping (to be discussed in Section 11.5) may also be used for evaluating the precision of robust 
regression results. 

5. When the Huber, bis(juare, and other weight functions are based on the scaled residuals 
in (11.47), they primarily reduce the influence of cases that are outlying with respect to their Y 
values. To make the robust regression fit more sensitive to cases that are outlying with respect to their 
X values, studentized residuals in (10.20) or studentized deleted residuals in (10.24) may be used 
instead of the scaled residuals in (11.47) ‘ Again, ■y/MSE may be replaced by MAD in (11.46) for better 
resistance and robustness when calculating the studentized or studentized deleted residuals. 

In addition, the weights u;,. obtained from the weight function may be modified to reduce directly 
the influence of cases with large X leverage. One suggestion is to multiply the weight function weight 
Wi by Vl ― hi, where hu is the leverage value of the /th case defined in (10.18). 

Methods that reduce the influence of cases that are cnutlying with respect to their X values are 
called bounded influence regression methods. ^ _ 

11.4 Nonparametiic Regression ： Lowess Method 

and Regression Trees_ 

We considered nonparametiic regression in Chapter 3 when there is one predictor variable 
in the regression model. We noted there that nonparametric regression fits are useful for 
exploring the nature of the response function, to confirm the nature of a particular response 
function that has been fitted to the data, and to obtain estimates of mean responses without 
specifying the nature of the response function. _ _ 

Nonparametric regression can be extended to multiple regression when there are two or 
more predictor variables. Additional complexities are encountered, however, when making 
this extension. With more than two predictor variables, it is not possible to show the fitted 
response surface graphically, so one cannot see its appearance. Unlike parametric regression, 
no analytic expression for the response surface is provided by nonparametric regression. 
Also, as the number of predictor variables increases, there maybe fewer and fewer cases in 
a neighborhood, leading to erratic smoothing. This latter problem is less serious when the 
predictor variables are highly correlated and interest in the response surface is confined to 
the region of theX observations. 

Numerous procedures have been developed for fitting a response surface when there 
are two or more predictor variables without specifying the nature of the response function. 
Reference 11.9 discusses a number of these procedures. These include locally weighted 
regressions (Ref. 11.10), regression trees (Ref. 11.11), projection pursuit (Ref. 11.12), and 
smoothing splines (Ref. 11.13). We disctiss the lowess method and regression trees in this 
section. We first extend the lowess method to multiple regression. In doing so, we will be 
able to describe it in far greater detail because we have established the necessary foundation 
of weighted least squares in Section 11.1. 

Lowess Method , 

We described the lowess method briefly in Chapter 3 for regression with one predictor 
variable. The lowess method for multiple regression, developed by Cleveland and Devlin 



450 Part Two Multiple Uuear Regression 


(Ref. 11.10), assumes that the predictor variables have already been selected, that the re 
sponse function is smooth, and that appropriate transformations have been made or other 
remedial steps taken so that the error terms are approximately normally distributed with 
constant variance. For any combination of X levels, the lowess method fits- either a first ， 
order model or a second-order model based on cases in the neighboihood, with more 
distant cases in the neighborhood receiving smaller weights. We shall explain the lowess 
method for the case of two predictor variables when we wish to obtain the fitted value 

at (x"} ， yc . 

Distance Measure. We need a distance measure showing how far each case is from 
(X/,i, X /j2 ). Usually, a Euclidean distance measure is employed. For the ；th case, this 
measure is denoted by cl { and ts defined: 

= [di — Xf,\) 2 + (X/2 — X /l2 ) 2 ]" 2 (11.55) 

When the predictor variables are measLired on different scales, each should be scaled by 
dividing it by its standard deviation. The median absolute deviation estimator in (11.46) 
can be used in place of the standard deviation if outliers are present. 

Weight Function. The neighborhood about the point (X /tl , X /l2 ) is defined in terms of 
the proportion q of cases that are nearest to the point. Let cl (/ denote the Euclidean distance 
of the furthest case in the neighborhood. The weight function used in the lowess method is 
the triciibe weight function, which is defined as follows: 

w；-- 

Thus, cases outside the neighborhood receive weight zero and cases within the neighborhood 
receive weights between 0 and 1, the weight decreasing with greater distance. In this way, 
the mean response at (X/,|, X/,?) is estimated locally. 

The choice of the proportion q defining the neighborhood requires a balancing of two 
opposing tendencies. The lai'ger is q, the smoother will be the fit but at the same time the 
greater may be the bias in the fitted value, A choice of q between A and ,6 may often be 
appropriate. 

Local Fitting. Given the weights for the n cases based on ( 11 .55) and (II .56), weighted 
least squares is then used to fit either the first-order model (6.1) or the second-order 
model (6.16). The second-order model is helpful when the response surface has substantial 
curvature; moderate curvilineai'ities can be detected by using the first-order model. After 
the regression model is fitted by weighted least squaits, the fitted value Y h at (X/,i, X h o) then 
serves as the nonparametric estimate of the mean response at these X levels. By recalculat¬ 
ing the weights for different (X"| ， X li2 ) levels, fitting the response function repeatedly，and 
each time obtaining the fitted value Y h , we obtain information about the i*esponse surface 
without making any assumptions about the nature of the response function. 

We shall fit a nonparametiic regression function for the life insurance example in Chapter 10. 
A portion of the data for a second group of 3 8 managers is given in Table 11.7, columns 1- 
3 ‘ The relation between amount of life insurance carried (F) and income (X|) and risk 
aversion (X 2 ) is to be investigated, the data pertaining to managers in the 30-39 age group- 


[1 - c \； < c\ q 

0 cli > cl q 


(11.56) 



Chapter 11 Building the Regression Model III: Remedial Measures 451 


- < 

(4) 

(5) 


IV/ 

3.013 

0 

1.143 

,300 

4.212 

d 

3.461 

0 

2.663 

0 

2.188 

0 


% 


(i) 

入 /i 

66.290 

40.964 

72.996 

79.380 

52.766 

55.916 


TABLE ” 

towess 
(Calculation 8 
for No»- 

paranietric 

jlegression Fit 
atXfti = 30 ^ 
Xft2 =3 ~" Life 

Insurance 

Example. 


The local fitting will be done using the first-order model in (6.1) because the number of 
available cases is not too large. For the same reason, the proportion of cases defining the 
local neighborhoods is set at 分 =.5; in other words, eacji local neighborhood is to consist 
of half of the cases. 

The exploration of the response surface begins at i = 30, X\a — 3. To obtain a locally 
fitted value a.t X hl = 30, Xh 2 = 3, we need to obtain the Euclidean distances of each case 
from this point We shall use the sample standard deviations of the two predictor variables to 
standardize the variables in obtaining the Euclidean distance since the two variables are mea¬ 
sured on different scales. The sample standard deviations are s { = 14.739 and 52 = 2.3044. 
For case 1, the Euclidean distance from X hl — 30, X h2 = 3 is obtained as follows: 

J (Y66.290 —30、 2 / 7-3 \ 2 1 1/2 o 

1 = ~14.739 ) + (2.3044 j J = ' — 

The Euclidean distances are shown in Table 11.7, column 4. The Euclidean distance of the 
furthest case in the neighborhood of = 30, Xh 2 = 3 for ^ = .5 is for the ninth case when 
these are ordered according to their Euclidean distance. It is d q = 1.653. Since d x — 3.013 > 
1.653, the weight assigned for case 1 is =0. For case 2, the Euclidean distance is 
d 2 = 1.143. Since this is less than 1.653, the weight for case 2 is: 

w 2 = [1 - (1.143/1.653) 3 ] 3 = .300 

The weights are shown in Table 11.7, column 5. 

The fitted first-order regression function using these weights is: 

y = —134.076 + 3.571& + 10.532X 2 

The fitted value for Xh } = 30, = 3 therefore is: 

Y h = -134.076 + 3.571(30) + 10.532(3) = 4.65 

In the same fashion, locally fitted values at other-values of and Xh 2 are calculated. 
Figure 11.8a contains a contour plot of the fitted response surface. The surface clearly 
ascends as increases, but the effect df X 2 is more difficult to see from the contour plot. 
The effect of X 2 can be seen more easily by the conditional effects plots of Y against 
X { at low, middle, and high levels of X 2 in Figure 11.8b. The conditional effects plots in 
Figure 11.8b are also called two-variable conditioning plots. Note that the expected amount 
of life insurance carried increases with income (X|) at all levels of risk aversion (X 2 ). The 





452 Part Two Multiple Linear Regression 


FIGURE 11.8 Contour and Conditioning Plots for Lowess Nonparametric Regression — Life Insurance 
Example. 

(a) Contour Plot 



(b) Two-Variable Conditioning Plots 





response functions for X 2 — 3 and X 2 = 6 appear to be approximately linear. The dip in the 
Left part of the response function for X 2 = 9 may be the result of an interaction or of noisy 
data and inadequate smoothing. Note also from Figure 11.8b that the expected amount of 
life insurance carried at the higher income levels increases as the risk aversion becomes 
very high. 


Comments 

1. The fitted nonparametric response surface can be used, just as for simple regress ion. for exam¬ 
ining the appropriateness of a fitted parametric regression model. If the fitted nonparametric response 
surface falls within the confidence band in (6.60) for the parametric regression function, the nonpara- 
metric fit supports the appropriateness of the parametric regression function. 

2. Reference 11.10 discusses a procedure to assist in choosing the proportion q for defining a 
local neighborhood. It also describes how the precision of any fitted value Y y obtained with lowess 
nonparametric multiple regression can be approximated. 



Chapter 11 Building the Regression Model III: Remedial Measures 453 


3. The assumptions of normality and constant variance of the error tarns required by the lowess 
nonparametric procedure can be checked in the usual fashion. The residuals are obtained by fitting 
the lowess nonparametric r^ression function for each case and calculating e,- = 7 ； — K ； as usual. 
These residuals will not have the least squares property of summing to zero, but can be examined for 
normality and constancy of variance. The residuals can also serve to identify outliers that might not 
be disclosed by standard diagnostic procedures. 

4. A discussion of some of the advantages of the lowess smoothing procedure is presented in 

Reference 11.14. _ 

Regression Trees 

Regression trees are a veiy powerful, yet conceptually simple, method of nonparametric 
regression. For the case of a single predictor, the range of the predictor is partitioned into 
segments and within each segment the estimated regression fit is given by the mean of 
the responses in the segment. For two or more predictors, the X space is partitfcned into 
rectangular regions, and again, the estimated regression surface is given by the mean of 
the responses in each rectangle. Regression trees have become a popular alternative to 
multiple regression for exploratory studies, especially for extremely large data sets. Along 
with neural networks (see Chapter 13), regression trees are one of the standard methods 
used in the emerging field of data mining. Regression trees are easy to calculate, require 
virtually no assumptions, and are simple to interpret.' 

One Predictor Tree: Steroid Level Example. Figure 1.3 on page 5 presents data on age 
and level of a steroid in plasma for 27 healthy females between 8 and 25 years of age. 
The data are shown in the first two columns of Table 11.8. A regression tree based on five 
regions is obtained by partitioning the range of X (age) into five segments or regions, and 
using the sample average of the Y responses in each region for the fitted regression surface. 
We will use R 5l through R 55 to denote the regions of a 5-region tree, and f R5l through f R55 
to denote the corresponding sample averages. These values are shown for the steroid level 
example in columns 4-6 of Table 11.8. The fitted regression tree is shown in Figure 11.9a. 
Note that the regression tree is a step function that steps up rapidly for girls between the 
ages of 8 and 14, after which point steroid level is roughly constant. 

A plQt of residuals versus fitted values is shown in Figure 11.9b. Note that the variance 
of the residuals in each region seems roughly constant, an indication that further splitting 
may be unnecessary. We discuss the determination of appropriate tree size below. 


TABLE 11.8 

Dotn Set aud 

0) 

(2) 

Q) 

(4) 

(5) 

⑹ 

5-Region 
Regression 
IVeeFit — 

Case 

/ 

Steroid 

Level 

Yi 

Age 

Xi - 

Region 

Number 

k 

Region 

Rsk 

Fitted 

Value 

Steroid Levd 

1 

27.1 

23 

1 

r 8<X <9 

3.550 

Example. 

2 

22.1 

19 

2 

9<X<10 

8,133 


3 

21,9 

25 

- 3 

10 < X <^13 

13.675 


• * * 

* * * 

* * * 

4 

13 < X < 14 

16.950 


25 

12.8 

13 

5 

14 < X < 25 

22.200 


26 

20.8 

14 





27 

20.6 

18 


s. 




454 Part Two Multiple Linear Regression 


FIGURE 11.9 

Fitted 

Regression 

Tree, Residual 

Plot，and 

Regression 

Tree 

Diagram — 
Steroid Level 
Example. 


(a) (b) 



Determining the predicted value for a given X h is accomplished with the help of a 
tree diagram, such as the one shown in Figure 11.9c. Suppose we wish to determine the 
predicted value at X/, = 12.5. Starting at node 1 — the root node — we ask, “Is Age < i3?” 
Since 12.5 < 13, we follow the left branch to node 2 where we ask, “Is Age < 10?” Since 
Age is not less than 10, we branch right to the terminal node labeled Leaf 3, where we find 
from Table 11.8 that P^ 53 = 13.675. Tree diagrams such as that shown in Figure 11.9c are 
particularly helpful when more than a single predictor is present. 

Growing a Regression Tree. To find a “best” regression tree, it is necessary to specify the 
number of regions, r, and the boundaries, or split points, between the regions. The process 
of determining a best value for r and the associated split points is referred to as growing 
the tree. 

First consider the case of a single predictor, and assume that the range of X is to be 
divided into r —2 regions, /? 2 i and R 2 2 - We need to find the split point X s that optimally 
divides the data into two sets. The best point is chosen to minimize the error sum of squares 



Age Predicted 



Chapter 11 Building the Regression Model HI: Remedial Measures 455 


for the resulting regression tree: 

SSE^ SSE(R 2l ) + SSE(R 22 ) 

where SSE(R r j) is the sum of squared residuals in region R r f 

SSE(R rj ) = - Y Rjk ) 2 

For the steroid level data, the best split point is shown in Figure 11.10a to be X s 
For this tree, we have: 

R 2l = {X\X < 13} 

R 22 = {X\X > 13} 

t 

for which we obtain: - 

SSE^ SSE(R 2l ) + SSE(R 22 ) = 238.55 +' 167.79 = 406.35 
From (2.72), the coefficient of determination* for the regression treels: 


13.0. 


SSE 

SSTO 


406.35 

1284^8 


.684 


Also, MSE^ SSE/(n - r) = 406.35/(27 - 2 ) 二 16.254 




456 Part Two Multiple Linear Regression 


At this point, there are two regions, and growing the tree further will require the identic 
cation of a third region. We have two choices: (I) we can work sequentially and split One of 
the two existing regions, or (2) start from scratch and identify simultcineously two entir I 
new split points that globally minimize the resulting SSE criterion. The second approach 
will always lead to a criterion value that is at least as good as the first; however, as the tree 
grows, so do the computational demands associated this approach (particularly if th^- e 
more than one predictor). For this reason, regression trees are generally grown sequentially 
according to the following rule: If the tree cuiTently is based on r regions, we deteitnine the 
best split point for each of the regions, and then split the region that leads to the greatest 
decrease in SSE. 

For the steroid-level example, the next step involves splitting /? 2 i at X v =： I0 } resulting 
in three regions: 


/? 2I = {X\X < 10} 

R v _ = {X|10 < X < 13} 
{X\X >M3} 


A plot of this tree is shown in Figure 11.10b. Continuing this process, we next split /? 33 at 
X x = 14, and a final split occurs at X s = 19. The 4-region and 5-region regression trees are 
shown in Figures 11.10c and 1 l.lOd. 

For two or more predictors, the procedure is the same, except that in addition to deter¬ 
mining the best region and split point, we must also determine the best predictor upon which 
to base the split. The rule is as follows: assuming the tree is based currently on r rectangular 
regions, we determine the best split point for each of the r regions for each of the p - 1 
predictors, and then implement a new split based on the region and predictor that leads to 
the largest decrease in SSE. Note that we are choosing the best predictor-and-split-point 
combination from r(p — 1) possibilities. 

This process is illiistrated for two predictors in Figure 11.11. We first consider splitting 
the rectangulai* X space either on the basis of X| or X 2 - We find the best split points X h 
and X 2 s for and X 2 respectively, and then we base our next partition on the split point 
that leads to the greatest decrease in SSE. According to Figure 11.11 a, the first split is based 
on X 1 , resulting in two rectangular regions R 2 [ and R 72 - For each of these two regions, we 
determine the best predictor upon which to split and the associated split point, and choose 
the combination that leads to the largest decrease in SSE. Figure 11.1 lb indicates that i-egion 
R 22 was partitioned in this step on the basis of X 2 . Finally, in the third split, region Ru is 
partitioned on the basis of X[, resulting in a 4-region tree, as shown in Figure 11.1 lc. 

FIGURE 11.11 

Regression 
Tree Growth — 

Two-Predictor 
Example. 


(a) 

Branch 1 — to 2 Regions 
Best split based on ；(■[ 




(b) 

Branch 2 — to 3 Regions 
Best split based on 入 2 in R 22 


(c) 

Branch 3 一 to 4 Regions 
Best split based on Xt in R 31 






r 32 




R 43 

r 21 

R 22 

入 2 

^31 

R33 

^2 

尺 41 

尺 42 

r 44 


入 1 







Chapter 11 Building the Regression Model III: Remedial Measures 457 


Determining the Number of Regions, r. If the tree-growing process is allowed to con¬ 
tinue indefinitely, there will eventually be n regions, with each region containing a single 
observation, and further partitioning will be impossible. A “best*’ number of regions will 
generally fall between 1 and «， and is usually chosen through validation studies. For exam¬ 
ple, for each split we determine, in addition to SSE, the mean square for prediction error 
MSPR for data in a hold-out or validation sample. We then choose the tree that minimizes 
MSPR. 



We illustrate the use of regression trees with the University admissions data set in Ap¬ 
pendix C.4. We fit GPA at the end of freshman year (F) as a function of ACT entrance test 
score (Xj) and high school rank (X 2 ). The data consist of 705 ca&s，and a random sample 
of n* = 353 records was selected for the validation set. Figure M.12a provides a plot of 
MSPR versus the number of regions, or temiinal nodes. The plot shows that the ability to 
predict improves as nodes are added until r 5, for which MSPR = .318 (0SE for this 





458 Part Two Multiple Linear Regression 


model is .322). For r > 5, the ability to predict responses in the validation set deteriorate 
as the number of regions increases. A plot of MSE is also included, and as expected MSE 
decreases monotonically with the size of the tree. The fitted regression tree surface is show 
in Figure 11.12b and the conesponding tree diagram is shown in Figure 11.12c. 

A plot of residuals versus predicted values is shown for this tree in Figure 11.12d 
that the variance of the residuals appears to be somewhat constant, and indication that 
further partitions may not be required. 


It is instructive to compare qualitatively the fit of the regression tree to the fit obtained 
using standard regression methods. Using a full second-order model leads to the equation, 

1.77 - .0223X, + -0780X 2 + .000187Xf- .00133XJ + .000342X^2 

MSPR for the second-order regression model is .296, which is slightly better than the value 
obtained by the regression tree (.318). Interestingly the MSE value obtained by the second- 
order regression model (.333) is about the same as that obtained by the regression tree 
(-322). ^ , 


In summary，the regression tree surface suggests as expected that college GPA increases 
with both ACT score and high school rank. Overall, high school rank seems to have a 
slightly more pronounced effect than ACT score. For this tree, R 2 is .256 for the training 
data set, and .157 for the validation data set. We conclude that GPA following freshman 
year is related to high school rank and ACT score, but the fraction of variation in GPA 
explained by these predictors is quite small. 


Comments 

1. The number of regions r is sometimes chosen by minimizing ihe cost complexity criterion ： 

r 

CAr) = V SSE{R rk ) + kr 



The cost complexity criterion has iwo components: the sum of squared residuals plus a penalty, Xr, 
for the number of regions r employed. The tuning parameter 入 > 0 determines the balance between 
the size of the tree (complexity) and the goodness of fit. Larger values of k lead to smaller trees. 
Note that this criterion is a form of penalized least squcues. which, as we commented in Section 11.2, 
can be used to obtain ridge regression estimates. Penalized least squares is also used in connection 
with neuml networks as described in Section 13.6. A “besi” value fo ! ■入 is generally chosen through 
validation studies. 

2. Regression trees are often used when the response Y is qualitative. In such cases, predicting 
a response at Xi, is equivalent to determining to which response category X/, belongs. This is a 
classification problem, and the resulting tree is referred to as a classification tree. Details are provided 
in References 11.11 and 11.15. ■ 


11.5 Remedial Measures for Evaluating Precision 
_in Noustandard Situations — Bootstrapping __ 

For standard fitted regression models, methods described in eailier chapters are available for 
evaluating the precision of estimated regression coefficients, fitted values, and predictions of 
new observations. However, in many nonstandard situations, such as when nonconstant error 



Chapter 11 Building the Regression Model III: Remedial Measures 459 


variances are estimated by iteratively reweighted least squares or when robust regression 
estimation is used, standard methods for evaluating the precision may not be available or 
may only be approximately applicable when the sample size is large. Bootstrapping was 
developed by Efron (Ref. 11.16) to provide estimates of the precision of sample estimates 
for these complex cases. A number of bootstrap methods have now been developed. The 
bootstrap method that we shall explain is simple in principle and nonparametric in nature, 
like all bootstrap methods, it requires extensive computer calculations. 

General Procedure 

We shall explain the bootstrap method in terms of evaluating the precision of an estimated 
regression coefficient. The explanation applies identically to any other estimate, such as a 
fitted value. Suppose that we have fitted a regression model (simple of multiple) by some 
procedure and obtained the estimated regression coefficient we now wish to evaluate the 
precision of this estimate by the bootstrap method. In essence, the bootstrap method ca^Is for 
the selection from the observed sample data of a random sample of size/? with replacement. 
Sampling with replacement implies that the bootstrap sample may contain some duplicate 
data from the original sample and omit some other data in the original sample. Next, the 
〆 bootstrap method calculates the estimated regression coefficient from the bootstrap sample, 
using the same fitting procedure as employed for the original fitting. This leads to the first 
bootstrap estimate b*. This process is repeated a large number of times; each time a bootstrap 
sample of size n is selected with replacement from the original sample and the estimated 
regression coefficient is obtained for the bootstrap sample. The estimated standard deviation 
of all of the bootstrap estimates b\, denoted by is an estimate of the variability of 

the sampling distribution of b\ and therefore is a measure of the precision of b\. 

Bootstrap Sampling 

Bootstrap sampling for regression can be done in two basic ways. When the regression 
function being fitted is a good model for the data, the error terms have constant variance, 
and the predictor variable(s) can be regarded as fixed X sampling is appropriate. Here 

the residuals e\ from the original fitting are regarded as the sample data to be sampled with 
replacement After a bootstrap sample of the residuals of size/? has been obtained, denoted 
bye*,, e*, the bootstrap sample residuals are added to the fitted values from the original 
fitting to obtain new bootstrap Y values, denoted by Y *,..., Y*: 

Y*^Yi+e* ( 11 . 57 ) 

These bootstrap Y* values are then regressed on the original X variable(s) by the same 
procedure used initially to obtain the bootstrap estimate b*. 

When there is some doubt about the adequacy of the regression function being fitted, the 
error variances are not constant, and/or the predictor variables cannot be regarded as fixed, 
random X sampling is appropriate. For simple regression, the pairs of X and F data in the 
original sample are considered to be the data to be sampled with replacement Thus, this 
second procedure samples cases with replacement n times, yielding a bootstrap sample of 
n pairs of (X*, F*) values. This bootstrap,sample is then used for obtaining the bootstrap 
estimate b*, as with fixed X sampling. 

The number of bootstrap samples to be selected for evaluating the precision of an 
estimate depends on the special circumstances of each application. Sometimes, as few 



460 Part Two Multiple Linear Regression 


as 50 bootstrap samples are sufficient. Often, 200-500 bootstrap samples are adequate One 
can observe the variability of the bootstrap estimates by calculating / 柯 } as the number 
of bootstrap samples is incieased. When stabilizes fairly reasonably, bootstra' 


can be terminated. 


PPing 


Bootstrap Confidence Intervals 

Bootstrapping can also be used to arrive at approximate confidence intervals. Much research 
is ongoing on different procedures for obtaining bootstrap confidence intervals (see 
example. References 11.17 and 11.18). A relatively simple procedure for setting up a 1 - a 
confidence interval is the reflection method. This procedure often produces a reasonable 
approximation, but not always. The reflection method confidence interval foi i s 
on the (o?/2) 100 and (1 — a/2) 100 percentiles of the bootstrap distribution of b*. These 
percentiles are denoted by b^(a/2) and b*(\ — a/2), respectively. The distances of these 
percentiles from the estimate of from the original sample, are denoted by d\ and dy 

te ~~ 

山二 b' -b\{a/T) (11.58a) 

(h = ^(1 — a/2) — b\ (11.58b) 

The approximate 1 — a confidence interval for )0 1 then is: 

by - cl 2 < < bi + ch (11.59) 

Bootstrap confidence intervals by the reflection method require a lai^er number of boot¬ 
strap samples than do bootstrap estimates of precision because tail percentiles are required. 
About 500 bootstrap samples may be a reasonable minimum number for reflection bootstrap 
confidence intervals. 


Examples 


We illustrate the bootstrap method by two examples, [n the first one, standard analytical 
methods are available and bootstrapping is used simply to show that it produces similar 
results. In the second example, the estimation procedure is complex, and bootstrapping 
provides a means for assessing the precision of the estimate. 


Example 1 — 

Toluca 

Company 


We use the Toluca Company example of Table 1.1 to illustrate how the bootstrap method 
approximates standard analytical results. We found in Chapter 2 that the estimate of the 
slope )0! \sbi = 3.5702, that the estimated precision of this estimate is s{b\ } = 3470, and 
that the 95 percent confidence interval for j0i is 2.85 < < 4.29. 

To evaluate the precision of the estimate b { == 3.5702 by the bootstrap method, we shall 
use fixed X sampling. Here, the simple linear regression function fits the data well, the error 
variance appears to be constant, and it is reasonable to consider a repetition of the study 
with the same lot sizes. A portion of the data on lot .size (X) and work houi s (K) is repeated 
in Table 11.9, columns 1 and 2. The fitted values and residuals obtained from the original 
sample are repeated from Table 1.2 in columns 3 and 4. Column 5 of Table 11.9 shows the 
first bootstrap sample of n residuals ef, selected from column 4 with replacement. Firmly, 
column 6 shows the first bootstrap sample Y* observations. Foi. example, by (11-57), w 
obtain Kf = K, + = 347.98 - 19-88 = 328.1. 

When the values in column 6 are regressed against the X values in column 1, based 
on simple linear regression model (2.1), we obtain = 3.7564. In the same way, 999 ote 
bootstrap samples were selected and b\ obtained for each. Figure 11.13 contains a histogram 



FIGURE 11.13 

Histogram of 

Bootstrap 

Estimates 

办 j—Toluca 

Company 

Example. 


2.40 2.88 3.36 3.84 4.32 4.80 - 

Bootstrap b* 

bj (.025) = 2.940 s*{bf} = 3251 b; (.975) = 4.211 一 

of the 1,000 bootstrap b\ estimates. Note that this bootstrap sampling distribution is fairly 
symmetrical and appears to be close to a normal distribution. We also see in Figure 11,13 
that the standard deviation of the 1,000^* estimates is = .3251, which is quite close 

to the analytical estimate 5 (^ 1 } = .3470. 

To obtain an approximate 95 percent confidence interval for by the bootstrap reflection 
method, we note in Figure 11.13 that the 2.5th and 97.5th percentiles of the bootstrap sam¬ 
pling distribution are ^*(.025) = 2.940 and Z 7 *(. 975 ) =4.211, respectively. Using (11.58), 
we obtain: 

t 

dy = 3.57Q2 一 2,940 = .630 
d 2 = 4.211 - 3.5702 = .641 , 

Finally, we use (11,59) to obtain the confidence limits 3.5702 4 - ?630 = 4.20 and 
3.5702 — .641 = 2.93 so that the approximate 95 percent confidence interval for is: 

2.93 <6i< 4.20 


祐 BLE119 

pptstrappwig 
ffith Fixed X 



Tdaca 1 

Company 2 

Example- 3 


Chapter 11 Building the Regression Model III: Remedial Measures 461 


(1) 

(2) 

(3) 

(4) 

⑸ 

⑹ 


Original Sample 


Bootstrap Sample 1 

X/ 

Vi 

% 

e t 


Vf 

80 

399 

347.98 

51.02 

-19.88 

328.1 

30 

121 

169.47 

-48.47 

10.72 

180.2 

50 

221 

240.88 

-19.8 & 

-6.68 

234.2 

40 

244 

205.17 

38.83 

4.02 

209.2 

80 

342 

347.98 

-5.98 

-45.17 

^ 302.8 

70 

323 

312.28 

10.72 

51.02 

363.3 


o 5 

J ,- 0 

o . o 
xuuantDa--a^ 


i232425 


Note that these limits are quite close to the confidence limits 2.85 and 4.29 obtained by 
analytical methods. 



462 Part Two Multiple Linear Regression 


E 


xamp 
Blood 
Pressure 


Ie2- 


For the blood pressure example in Table 11.1，the analyst used weighted least squares 
in t)rdei* to recognize the unequal error variances and fitted a standard deviation funct. 
to estimate the unknown weights. The standard inference procedures employed by the 
analyst for estimating the precision of the estimated regression coefficient b w] — .59534 
and for obtaining a confidence interval for j3i are therefore only approximate. To examine 
whether the approximation is good here, we shall evaluate the precision of the estimated 
regression coefficient in a way that recognizes the impieciseness of the weights by 呢心 
bootstrapping. The X variable (age) probably should be regarded as random and the error 
variance varies with the level of X, so we shall use I'andom X sampling. Table 11.10 repeats 
from Table 11.1 the original data for age (X) and diastolic blood pressure (K) in columns 1 
and 2. Columns 3 and 4 contain the (Xf, K 广 ） abservatians for the first bootstrap sample 
selected with replacement from columns 1 and 2. When we now regress on X* by 
ordinary least squares, we obtain the fitted regression function: 


= 50.384 + .7432X^ 


The residuals for this fitted function are shown in column 5. When the absolute values of 
these residuals are regressed on X' the fitted standard deviation function obtained is: 

r = -5.409 + .32745 JT 


The fitted values 5,^ are shown in column 6. Finally, the weights wf = l/(5f) 2 are shown in 
column 7. For example, ~ 1/(10.64) 2 = .0088. Finally, F* is regressed on X* by using 
the weights in column 7, to yield the bootstrap estimate = .838. 

This process was repeated 1,000 times. The histogram of the 1,000 bootstrap values b* 
is shown in Figure 11.14 and appears to approximate a normal distribution. The standard 
deviation of the 1,000 bootstrap values is shown in Figure 11.14; it is = .0825. When 
we compare this precision with that obtained by the approximate use of (11.13), .0825 versus 
.07924, we see that recognition of the use of estimated weights has led here only to a small 
increase in the estimated standard deviation. Hence, the variability in b lt ,\ associated with 
the use of estimated variances in the weights is not substantial and the standard inference 
procedures therefore provide a good approximation here. 

A 95 percent bootstrap confidence interval for can be obtained from (11.59) by 
using the percentiles /^(.025) .4375 and b*(.915) = .7583 shown in Figure 11.14. The 


TABLE 11.10 

Bootstrapping 
with Random 


(1) (2) 
Original Sample 

(3) 

(4) 

⑸ (6) 
Bootstrap Sample 1 

(7) 

X Sampling — 
Blood Pressure 
Example. 

i 


Y; 

Xf 

V ； 

< 

5 f 

< 

i 

27 

73 

49 

101 

14.20 

10.64 

.0088 

2 

21 

66 

34 

73 

-2.65 

5.72 

.0305 


3 

22 

63 

49 

101 

14.20 

10.64 

.0088 


^ 52 

52 

100 

46 

89 

4.43 

9.65 

.0107 


53 

58 

80 

27 

73 

2.55 

3.43 

.0850 


54 

57 

109 

40 

70 

- 10.11 

7.69 

.0169 



% Chapter 11 Building the Regression Model III: Remedial Measures 463 

■ ^ 


0.30 0.43 0.56 0.69 0.82 0.95 ^ 

Bootstrap b* 

b* (.025) = .4375 s*{bf} = .0825 Jbf (.975) = .7583 

approximate 95 percent confidence limits are [recall from (11.20) that b wl = .59634]: 

b wl —d 2 = .59634 - (.7583 - .59634) = .4344 
b wl +d { = .596344 - (.59634 - .4375) = .7552 

and the confidence interval for pi is: 

.434 < A < -755 — 

Note that this confidence interval is almost the same as that obtained earlier by standard 
inference procedures (.437 < < .755). This again confirms that it is appropriate to-use 

standard inference procedures here even though the weights were estimated. 

Comment 

The reason why 必 is associated with the upper confidence limit in (11.59) and d 2 with the lower 
limit is that the upper (1 — a/2) 100 percentile in the sampling distribution of identifies the lower 
confidence limit for J3i, whereas the lower (a/2) 100 percentile identifies the upper confidence limit 
To see this，consider the sampling distribution for for which we can state with probability l — a 
that 办 i will fall between: 

h(a/2) <^i <^(1- a/2) (11 -60) 

where b\{a/2) and b\{\ — a/2) denote the (a/2)100 ^nd (1 — a/2)100 percentiles of the sampling 
distribution of b\. We now express these percentiles in terms of distances from the mean of the 
sampling distribution, E{b\] — , 

Di = — b\{a/2) 、 ， 

i I, / / * (11.61) 

D 2 ^b x {\ - a/2)-)6, 



and obtain: 


bi (a/2) ~ D x 
b\{\ — ot/2 ) — + jC?2 


(11.62) 




464 Part Two Multiple Linear Regression 


Substituting (11.62) into (11.60) and rearranging the inequalities so that is in ihe middle leads 
the limits: St0 


~ 1^2 S $ h\ + 

The confidence interval in (11.59) is obtained by replacing D\ and D 2 by d\ and cl 2 , which involves 
using the percentiles of the bootstrap sampling distribution as estimates of the corresponding p er 
centiles of the sampling distribution of b x and using h as ihe estimate of the mean of the sampl'm 
distribution. _ 


11.6 Case Example — MNDOT Traffic Estimation 

Traffic monitoring involves the collection of many types of data, such as traffic volume 
traffic composition, vehicle speeds, and vehicle weights. These data provide information for 
highway planning, engineering design, and traffic control, as well as for legislative decisions 
concerning budget allocation, selection of state highway routes, and the setting of speed 
limits. One of the most important traffic monitoring variables is the average annual daily 
traffic (AADT) foi- a section of road or highway. AADT is defined as the average, over a 
year, of the number of vehicles that pass through a particular section of a rt)ad each day. 
Information on AADT is often collected by means of automatic traffic recorders (ATRs). 
Since it is not possible to install these recorders on all state road segments because of 
the expense involved, Cheng (Ref. 11.19) investigated the use of regression analysis for 
estimating AADT for road sections that are not monitored in the state of Minnesota. 


The AADT Database 


Seven potential predictors of traffic volume were chosen from the Minnesota Department 
of Transportation (MNDOT) road-lag database, including type of road section, population 
density in the vicinity of road section, number of lanes in road section, and mad sec¬ 
tion's width. Four of the seven variables were qualitative, requiring 19 indicator variables. 
Preliminary regression analysis indicated that the large number of levels of two of the qual¬ 
itative variables was not helpful. Consequently, judgment and statistical information about 
marginal [.eductions in the errt)i，sum t)f squai.es were used to collapse the categories, so 
only 10 instead of 19 indicator variables remained in the AADT database. 

The variables included in the initial analysis were as follows: 


CTYPOP (Xj) — population of county in which road section is located (best praxy 
available for population density in immediate vicinity of road section) 

LANES (X 2 ) — number of lanes in road section 


WIDTH (X 3 ) — width of road section (in feet) 

CONTROL (X4) — two-category qualitative variable indicating whether or not there is 
ctintrol of access to road section (1 = access control; 2 = no access control) 

CLASS (Xs X 6 , X 7 ) — four-category qualitative variable indicating road section 
function (1 = rural interstate; 2 — rural non interstate; 3 = urban interstate, 

4 = urban noninterstate) 

TRUCK (X 8 , X 9 , X] () , X,j) — five-category qualitative variable indicating availability 
status of road section to trucks (e.g., tonnage and ti me-of-year restrictions) 



Chapter 11 Building the Regression Model III: Remedial Measures 465 


TABLE 


11.11 Data ― MNDOT Traffic Estimation Example. 







Access 

Function 

Truck 



County 



Control 

Class 

Route 

Locale 

^AADT 

Population 

Lanes 

Width 

Categoiy 

Categoiy 

Categoty 

Categoiy 

^Yi 

Xn 

Xi2 

入 《 


(X f5 to X /7 ) 

(Xf8 to 入 ui) 

入， ,13> 

1 # 616 

13,404 

2 

52 

2 

2 

5 

1 

1,329 

52,314 

2 

60 

2 

2 

5 

1 

3,933 

30,982 

2 

57 

2 

4 

5 

2 

姓 905 

459,784 

4 

68 

2 

4 

5 

2 

15 # 408 

459,784 

2 

40 

2 

4 

5 

3 # 

1,266 

43,784 

2 

44 

2 

j 

4 

5 

2 


5 C Cheiig, ^Optimal Sampling for Traffic Volume Estimation" unpublished Ph.D. dissertation. University of Minnesota, Carlson School of Management, 1992. 


LOCALE (X| 2 , X 13 ) — three-category qualitative variable indicating type of locale 
(1 = rural; 2 = urban, population < 50,000; 3 = urban, population > 50,000) 

〆 

A portion of the data is shown in Table 11.11. Altogether, complete records for 121 ATRs 
were available. For conciseness, only the category is shown for a qualitative variable and 
not the coding of the indicator variables. 


Model Development 

A SYSTAT scatter plot matrix of the data set, with lowess fits added, is presented in Fig¬ 
ure 11.15. We see from the first row of the matrix that several of the predictor variables 
are related to AADT. The lowess fits suggest a potentially curvilinear relationship between 
LANES and AADT. Although the lowess fits of AADT to the qualitative categories desig¬ 
nated 1, 2, 3, etc., are meaningless, they do highlight the average traffic volume for each 
category. For example, the lowess fit of AADT to CLASS shows that average AADT for 
the third category of CLASS is higher than for the other three categories. The scatter plot 
matrix also suggests that the variability of AADT may be increasing with some predictor 
variables, for instance, with CTYPOP. 

An initial regression fit of a first-order model with ordinary least squares, using all 
predictor variables, indicated that CTYPOP and LANES are important variables. Regression 
diagnostics for this initial fit suggested two potential problems. First, the residual plot 
against predicted values revealed that the error variance might not be constant Also, the 
maximum variance inflation factor (10.41) was 24.55, suggesting a severe degree of multi- 
collinearity. The maximum Cook’s distance measure (10.33) was .2076, indicating that 
none of the individual cases is particularly influential. Since many of the variables appeared 
to be unimportant, we next considered the use of subset selection procedures to identify 
promising, initial models. . 、 

The SAS all-possible-regressions procedure, PROC RSQUARE, was used for subset 
selection. To reduce the volume of computation, CTYPOP and LANES were forced to be 
included. The SAS output is given in Figure 11.16. The left column indicates the number 
of X variables in the model, i.e., p — 1. The names of the qualitative variables identify the 




466 Part Two Multiple Linear Regression 


FIGURE 11.15 SYSTAT Scatter Plot Matrix 一 MNDOT Traffic Estimation Example. 


AADT 


CTYPOP 


1 圓 18 圓■■圓嘱 

VHHWHH V^WWHH 

■ ■■■■■ 

WBBBBPHPIB: 

■■■■■■Hi 

BBHBH 

BBB^HBMhH 


TRUCKRT 


LOCALE 


predictor variable and the category for which the indicator variable is coded 1. For example, 
CLASS 1 refers to the first indicator variable for the predictor variable CLASS; i.e., it refers 
to X 5 , which is coded 1 for category 1 (rural interstate). Two simple models look particularly 
promising. The three-variable model consisting of X t (CTYPOP), X 2 (LANES), and X 7 
(CLASS = 3) stands out as the best three-variable model, with R 2 p — .805 and Cp ~ 5.23- 
Since p = 4 for this model, the C p statistic suggests that this model contains litde bias. 
The best four-variable model includes & (CTYPOP), X 2 (LANES), X 4 (CONTROL=1). 
and X 5 (CLASS = 1). With this model, some improvements in the selection criteria are 
realized: R 2 p = .812 and C p = 2.65. On the basis of these results, it was decided to investigate 















Chapter 11 Building the Regression Model III' Remedial Measures 467 


FIGURE 11.16 

sas 

^.possible- 

Regre ss * ons 

Output— 

mndot 

•Traffic 

Estimation 


K = 121 


Regression Models for Dependent Variable: AADT 


In 


R-square C(p) Variables in Model 


2 0-694589 69.7231 CTYPOP LANES 


NOTE: The above variables are included in all models to follow 


3 

0*804522 

5.2315 

CLASS3 

3 

0.751353 

37.3903 

C0NTRDL1 

3 

0,725755 

52.8725 

TRUCK1 

3 

0.704495 

65.7318 

L0CALE2 

3 

0.704250 

65*8798 

CLASSl 

4 

0,812099 

2.6490 

C0NTR0L1 CLASSl ^ 

4 

0.810364 

3,6986 

CLASS3 LDCALE2 

4 

0,808001 

5.1275 

CLASS3 L0CALE1 

4 

0.807122 

5*6590 

CLASS2 CLASS3 

〜 

0.806300 

6_1562 

CLASS3 TRUCK4 

5 

0.816245 

2_1414 

C0NTR0L1 CLASSl L0CALE2 

5 

0.815842 

2*3848 

C0NTRDL1 CLASSl L0CALE1 

5 

0.814362 

3.2803 

CDNTRDL1 CLASSl CLASS2 

5 

0-813901 

3.5589 

C0NTRDL1 CLASSl TRUCK4 

5 

0.812788 

4.2321 

CONTROL1 CLASSl TRUCK2 

6 

0.818304 

2*8958 

WIDTH C0NTE0L1 CLASSl L0CALE1 

6 

0-817992 

3,0845 

CDNTRDL1 CLASSl TRUCK4 L0CALE2 

6 

0,817915 

3.1309 

C0NTR0L1 CLASSl TRUCK2 LDCALE2 

6 

0.817741 

3.2367 

C0NTE0L1 CLASSl TRUCK2 L0CALE1 

6 

0*817738 

3.2383 

WIDTH C0NTR0L1 CLASSl L0CALE2 

7 

0.820443 

3.6023 

WIDTH C0NTRDL1 CLASSl TEUCK4 L0CALE1 

7 

0_819942 

3.9050 

WIDTH C0NTR0L1 CLASSl TRUCK4 L0CALE2 

7 

0.819473 

4_1891 

WIDTH C0NTR0L1 CLASSl TEUCK2 L0CALE1 

7 

0*819180 

4.3663 

C0NTRDL1 CLASSl TRUCK2 TRUCK4 L0CALE2 

7 

0.819007 

4.4705 

WIDTH CONTEOLl CLASSl CLASS2 L0CALE1 


a model based on the five predictor variables included in these two models: X [ (CTYPOP), 
X 2 (LANES), X 4 (CONTROL = 1) ， X 5 (CLASS = 1), and X 7 (CLASS = 3). Note that be¬ 
cause Xq (CLASS = 2) has been dropped from further consideration, the rural noninterstate 
(CLASS = 2) and urban noninterstate (CLASS =： 4) categories 6f the CLASS variable have 
been collapsed into one category. ^ > 

Figure 11.17a contains a plot of the studentized residuals against the fitted values for the 
five-variable model. The plot reveals two potential problems: (1) The residuals tend to be 
positive for small and large values of Y and negative for intermediate values, suggesting a 
curvilinearity in the response function. (2) The variability of the residuals tends to increase 
with increasing F, indicating nonconstancy of the error variance. 









468 Part Two Multiple Linear RegrexsUm 


0 20000 40000 60000 80000 100000 0 50000 100000 150000 

Fitted Value Fitted Value 

Cui-vilinearity was investigated next, together with possible interaction effects. A squared 
term for each of the two quantitative variables (CTYPOP and LANES) was added to the 
pool of potential X variables. To reduce potential multicollinearity problems, each of these 
variables was first centered. In addition, nine cross-product terms were added to the pool 
of potential X variables, consisting of the cross products of the X variables for the four 
predictor variables. 

The SAS all-possible-regressions procedure was run again for this enlarged pool of 
potential X variables (output not shown). Analysis of the results suggested a model with five 
X variables: CTYPOP, LANES, LANES 2 , CONTROL 1, and CTYPOP x CONTROL 1. For 
this model, R 2 p is .925, and all P-values for the regression coefficients aie 0+. Although this 
model does not have the latest R 2 p value among five-term models, it is desirable because it is 
easy to interpret and does not differ substantially from other models favorably identified by 
the C /} or R 2 p criteria. A plot of the studentized residuals against F, shown in Figure 11.17b, 
indicates that curvilinearity is no longer present. Also, neither Cook’s distance measure 
(maximum = .47) nor the variance inflation factors (maximum = 2.5) revealed serious 
problems at this stage. Nonconstancy of the error terni variance has persisted, however, as 
confirmed by the Breusch-Pagan test. 

Weighted Least Squares Estimation 

To remedy the problem with nonconstancy of the error term variance, weighted least squares 
was implemented by developing a standard deviation function. Residual plots indicated that 
the absolute iresiduals vary with CTYPOP and LANES. A fit of a first-order model where 
the absolute iBsiduals are regressed on CTYPOP and LANES yielded an estimated standarf 
deviation function for which R 2 = .386 and the P-values for the regression coefficientsibr 
CTYPOP and LANES are .001 and 0+. Note that, as is often the case, the R 2 value for 


FIGURE 11.17 


Hots of Studentized Residuals versus Fitted Values — MNDOT Traffic Estimation Example, 
(a) (b) 


Ordinary Least Squares Fit of Initial 
(First-Order) Model 


6r- 


Ordinary Least Squares Fit of Final 
(Second-Order) Model 


6r- 


® 


0 


0 $ 


0 


0 


0 


0 


0 0^ 




® 


1 I 钸 

nr 


4 2 0 2 4 

I I 

len^say paziluapn- 


0 &0& 


0 




0 


0 0 




e 


00 ■ e 

® 

0 s 

® 瘳蠢囂珍 


§ $ 




4 2 0 2 

I 

l^nplssy p^z^UBPras 



Chapter 11 Building the Regression Model III: Remedial Measures 469 


FIGURE 11 - 18 


Squares 

g e gression 

Results— 

mndot 

lVaffic 

Estimation 

Example. 


The regression equation is 

AADT = 9602 + 0.0146 CTYPOP + 6162 LANES + 16556 CONTROL1 + 2250 LANES2 
+ 0.0637 P0PXCTL1 


Predictor 

Coef 

Stdev 

t-ratio 

P 


Constant 

9602 

1432 

6.71 

0.000 


CTYPOP 

0.014567 

0.003047 

4.78 

0.000 


LANES 

6161.8 

933.9 

6.60 

0.000 


C0NTE0L1 

16556 

2966 

5.58 

0.000 


LANES2 

2249.7 

755.8 

2.98 

0.004 


P0PXCTL1 

0.063696 

0.008421 

7.56 

0.000 


Analysis of 

Variance 





SOURCE 

DF 

SS 

MS 

F 

p 

Regression 

5 

919.55 

183.91 93.13 

0.000 

Error 

115 

227.10 

1.97 



Total 

120 

L146.65 





〆 

the estimated standard deviation function (.386) is substantially smaller than that for the 
estimated response function (.925). 

Using the weights obtained from the standard deviation function, weighted least squares 
estimates of the regression coefficients were obtained. Since some of the estimated regres¬ 
sion coefficients differed substantially from those obtained with unweighted least squares, 
the residuals from the weighted least squares fit were used to reestimate the standard devi¬ 
ation function, and revised weights were obtained. Two more iterations of this iteratively 
reweighted least squares process led to stable estimated coefficients. 

MINITAB regression results for the weighted least squares fit based on the final weights 
are shown in Figure 11.18. Note that the signs of the regression coefficients are all positive, 
as might be expected: 

CTYPOP: Traffic increases with local population density 

LANES: Traffic increases with number of lanes 

CONTROL 1: Traffic is highest for road sections under access control 

LANES 2 : An upward-curving parabola is consistent with the shape of the lowess fit of 
AADT to LANES in Figum 11.15 

CTYPOP x CONTROL 1: Traffic increase with access control is more pronounced for 
higher population density 

s 

Figure 11.19a contains a plot of the studentized residuals against the fitted values, and 
Figure 11.19b contains a normal probability plot of the studentized residuals. Notice that 
the variability of the studentized residuals is now approximately constant. While the nor¬ 
mal probability plot in Figure 11.19b indicates some departure from normality (this was 
confirmed by the correlation test for normality), the departure does not appear to be serious, 
particularly in view of the large sample size. 

To assess the usefulness of the model for estimating AADT, approximate 95 percent 
confidence intervals for mean traffic for typical rural, suburban, and urban road sections 



470 Part Two Multiple Linear Regivssioit 


FIGURE 11.19 
Residual Plots 
for Final 
Weighted Least 
Squares 
Regression 
Fit—MNDOT 
Traffic 
Estimation 
Example. 


(a) Residual Plot against Y 



Fitted Value 


(b) Normal Probability Plot 



一 3 —2 -1 0 1 2 3 


Expected 


TABLE 11.12 95 Percent Approximate Confidence Limits for Mean Responses — MNDOT Traffic Estimation 


Example. 


0) 

(2) 

⑶ 

(4) 

(5) 

⑹ 

⑺ 

Road 






Confidence Limits 

Section 

CTYPOP 

LANES 

CONTROL1 

% 


Lower 

Upper 

Rural 

113,571 

2 

0 

3,365 

354 

2,663 

4,066 

Suburban 

222,229 

4 

0 

16,379 

1,827 

12,758 

19,999 i 

Urban 

941,411 

6 

1 

116,024 

6,597 

102,953 

129,095 


were constructed. The levels of the predictor variables for these road sections are given in 
Table 11.12, columns 1-3. The estimated mean traffic is given in column 4. The approxi¬ 
mate estimated standard deviations of the estimated mean responses for each of these road 
sections, shown in column 5, were obtained by using s 2 {b,„} from (11.13) in (6.58): 

s 2 {Y h }= X^s^b^JX,, = MS 心 X’JX’WXr%, (11.63) 

where the vector X /f is defined in (6.53). Since the estimated standard deviations in column5 
are only approximations because the least squares weights were estimated by means of 
a standard deviation function, bootstrapping with random X sampling was employed to 
assess the precision of the fitted values. The standard deviations of the bootstrap sampling 
distributions were close to the estimated standard deviations - in column 5. The consistency 
of the results shows that the iterative estimation of the weights by means of the standard 
deviation function did not have any substantial effect here on the precision of the fitted 
values. 

The approximate 95 percent confidence limits for E{Y h ), computed using (6.59), arep^- 
sented in columns 6 and 7 of Table 11.12. The precision of these estimates was considered to 
be sufficient for planning purposes. However, because the suburban and rural road estimates 


Chapter 11 Building the Regression Model III: Remedied Measures 471 


have the poorest relative precision, it was recommended that better records be developed 
for population density in the immediate vicinity of a road section, since county population 
does not always reflect local population density. The improved information could lead to a 
better regression model, with more precise estimates for road sections in rural and suburban 
settings. 

The approach for developing the regression model described here is not, of course, the 
only approach that can lead to a useful regression model, nor is the analysis complete as 
described. For example, the residual plot in Figure 11.19a suggests the presence of at least 
one outlier {r ^2 = 5.02). Possible remedial measures for this case should be considered. 
In addition, the departure from normality might be remedied by a transformation of the 
response variable. This transformation might also stabilize the variance of the error terms 
sufficiently so that weighted least squares would not be needed. In fact, subsequent analysis 
using the Box-Cox transformation approach found that a cube root transformation of the 
response is very effective in this instance. A final choice between the model fit obtained by 
weighted least squares and a model fit developed by an alternative approach can be made 
on the basis of model validation studies. 


Cited n.i. 

References 

11 . 2 . 


Davidian, M., and R. J. Carroll. “Variance Function Estimation，” Journal of the American 
Statistical Association 82 (1987), pp. 1079—91. 

Greene, W. H. Econometric Analysis, 5th ed. Upper Saddle River, New Jersey: Prentice Hall, 


2003. 


11.3. Belsley, D. A. Conditioning Diagnostics: Collinearity and Weak Data in Regression. New 
York: John Wiley & Sons, 1991. 

11.4. Frank, I. E.，and J. H. Friedman. U A Statistical View of Some Chemometrics Regression Tools, 75 
Technometrics 35 (1993), pp. 109-35. 

11.5. Hoaglin, D. C.; F. Mosteller; and J. W. Tukey. Exploring Data Tables, Trends, and Shapes. 
New York: John Wiley & Sons, 1985. 

11.6. Rousseeuw, P. J., and A. M. Leroy. Robust Regression and Outlier Detection. New York: John 
Wiley & Sons, 1987. 

11.7. Kennedy, W. J., Jr., and J. E. Gentle. Statistical Computing. New York: Marcel Dekker, 1980. 

1 1.8. ETS Policy Information Center. America’s Smallest School: The Family. Princeton, N.J.: 
Educational Testing Service, 1992. 

11.9. Haerdle, W. Applied Nonparcanetric Regression. Cambridge: Cambridge University Press, 
1992. 


11.10. 


Qeveland, W. S., and S. J. Devlin. “Locally Weighted Regression: An Approach to Regression 
Analysis by Local Fitting," Journal of the American Statistical Association 83 (1988), pp. 596- 


610. 


11.11. Breiman, L.; J. H. Friedman; R. A. Olshen; and C. J. Stone. Classification and Regression 

Trees. Belmont, Calif.: Wadsworth, 1984. ' 

11.12. Friedman, J. H., and W. Stuetzle. “Projection Pursuit Regression,” Journal of the American 

Statistical Association 76 (1981), pp. 817—23. r 

11.13. Eubank, R.L. Spline Smoothing and Nonparametric Regression, 2nd ed. New York: Marcel 

Dekker, 1999. * 、 

11.14. Hastie, T., and C. Loader. “Local Regression: Automatic Kernel Carpentry” (with discussion). 
Statistical Science 8 (1993), pp. 120-43. 

11.15. Hastie, T., Tibshirani, R., and J. Friedman. The Elements of Statistical Learning: Data Mining, 
Inference, and Prediction. New York: Springer-Verlag, 2001. 



472 Part Two Multiple Linear Resiivssion 


11.16. 

11.17. 

11.18. 
11.19. 


Efron. B. The Jackknife. The Bootstrap, and Other Resampling Plans. Philadelphia P enn 
Society for Industrial and Applied Mathematics, 1982. *' 

Efron, B., anti R. Tibshirani. ■*Bootstitip Methods for Standard Errors, Confidence Interval 
and Other Measures of Statistical Accuracy,'' Statistical Science 1 {1986), pp. 54 — 77 . ^ 

Efron. B. “Better Bootstrap Confidence lntervitls" (with discussion). Journal of the Aineri 
Statistical Association 82 (1987), pp. 171-200. 如 


Cheng, C. “Optimal Sampling for Traffic Volume Estimmion,” unpublished Ph.D. dissertation 
University of Minnesota. Carlson School ol' Management. 1992. * 


Problems 


11.1. One student remarked to another: “Your residuals show that nonconstancy oi' error variance 
is clearly present. Therefore, your regression results are completely invalid 厂 Comment. 

11.2. An analyst suggested: “One nice thing about robust regression is that you need not worry 
about outliers and influential obsei-vations.'' Comment. 

11.3. Lowess smoothing becomes difficult when there are many p red idol's and the sample size is 
small. This is sometimes referred to as the "curse of dimensiomiHty.” Discuss the naiureof 
this problem. 

11.4. Regression trees become difficult to utilize when there are many predictoi-s and the sample 
size is small. Discuss the nature of this problem. 

11.5. Describe how bootstrapping might be used to obtain confidence intervals for regression coef¬ 
ficients when ridge regression is employed, 

11.6. Computer-assisted learning. Data from a study of computer-assisted learning by 12 students, 
showing the total number oi' responses tn completing a lesson (X) and the cost of computer 
time (y, in cents), follow. 


i: 1 2 3 4 5 6 7 8 9 10 11 12 

Xi: 16 14 22 10 14 17 10 13 19 12 18 11 

77 70 85 50 62 70 55 63 88 57 81 51 

a. Fit a linear regression function by ordinary least squares, obtain the j-esiduals, andplotthe 
residuals against X. What does the residual plot suggest? 

b. Divide the cases mto two groups, placing the six cases with the smallest fitted values 
into group 1 and the other six cases into group 2. Conduct the Bmwn-Forsythe test for 
constancy ot' the error variance, using a = .05. State the decision rule and conclusion. 

c. Plot the absolute values of the residuals against X. What does this plot KUggest about the 
relation between the standard deviation of the error term and X? 

d. Estimate the standaitl deviation function by regressing the absolute values of the residuals 
against X. and then calculate the estimated weight for each case using (1 1.16a). Which 
case receives the largest weight? Which case receives the smallest weight? 

e. Using the estimated weights, obtain the weighted least squares estimates of j 6 () and 办 . Are 
these estimates similar to the ones obtained with ordinary least squares in part (a)? 

f. Compare the estimated standard deviations of the weighted least squares estimates b w o 
and in part (e) with those for the ordinary least squares estimates in part (a). What do 
you find? 

g. Iterate the steps in parts (d) and (e) one more time. Is there a subsiantial change in the 
estimated regression coefficients? If so, what should you do? 




.Chapter 11 Building the Regression Model III: Remedial Measures 473 


1 58.8 3 

2 34.8 1 

3 163.7 3 


63 

40.0 

2 

.44 

0 

64 

60.5 

( 3 

2.10 

0 

65 

104 名 

3 

19.81 

24 


a. Create two indicator variables for highest degree attained: 


Degree • X' X 2 

Bachelors 0 0 

Master’s 1 0 

Doctoral 0 1 


*11.7. Machine speed. The number of defective items produced by a machine (K) is known to be 
linearly related to the speed setting of the machine (X). The data below were collected from 
recent quality control records, 

/: 1 2 3 4 5 6 7 8 9 10 11 12 

X,: 200 400 300 400 200 300 300 400 200 400 200 300 

Y } : 28 75 37 53 22 58 40 96 46 52 30 69 

a. Fit a linear regression function by ordinary least squares, obtain the residuals, and plot the 
residuals against X, What does the residual plot suggest? 

b. Conduct the Breusch-Pagan test for constancy of the error variance, assuming log^ cr?= 
y。+ Y\^i\ use o: = ,10, State the alternatives, decision rule, and conclusion. 

c. Plot the squared residuals against X. What does the plot suggest about the relation between 

the variance of the error term and XI ^ 

d. Estimate the variance function by regressing the squared residuals against X, and then 
calculate the estimated weight for each case using (11.16b). 

e. Using the estimated weights, obtain the weighted least squares estimates of Pq and fi\. 
y Are the weighted least square estimates similar to the ones obtained with ordinary least 
' squares in part (a)? 

f. Compare the estimated standard deviations of the weighted least squares estimates b w 。and 
b w i in part (e) with those for the ordinary least squares estimates in part (a). What do you 
find? 

g. Iterate the steps in parts (d) and (e) one more time. Is there a substantial change in the 
estimated regression coefficients? If so, what should you do? 

11.8, Employee salaries. A group of high-technology companies agreed to share employee salary 
information in an effort to establish salary ranges for technical positions in research and 
development. Data obtained for each employee included current salary (50, a coded vari¬ 
able indicating highest academic degree obtained (1 二 bachelor's degree, 2 = master’s degree, 
3 = doctoral degree), years of experience since last degree (X 3 ), and the number of persons 
currently supervised (X 4 ). The data follow. 

Employee 

/ Y n Degree X }4 


o o 42 


9 2 4 

■ 4 - 9 - 5 : 
4 - 2 - 9 -“ 




474 Part Two Multiple Linear Regression 


.490 

.461 

.443 

.463 

.410 

.401 

.398 

■394 

.296 

.322 

.336 

.349 

.354 

.356 

.356 

356 

.169 

.167 

.167 

.166 

.165 

.164 

.164 

.164 

20.07 

10.36 

6.37 

3.20 

1.98 

1.38 

1.20 

1.05 

20.72 

10.67 

6.55 

3.27 

2.07 

1.40 

1.21 

1.06 

1.22 

1.17 

1.14 

1.08 

1.02 

.98 

■95 

.93 

.7417 

.7416 

.7145 

.7412 

.7409 

.7045 

.7402 

.7399 


a. Make a ridge trace plot tor the given c values. Do me ridge regression coefficients exhitit 
substantial changes near c = 0? 

b. Suggest a reasonable value for the biasing constant c based on the ridge trace, the V7F 
values, and R 2 . 

c. Transform the estimated standardized regression coefficients selected in pan (b) back to 
the original variables and obtain the fitted values for the 44 cases. How .similar are these 
fitted values to those obtained with the ordinary least squares fit in Problem 10.13a? 

*11.10. Chemical shipment. The data to follow, taken on 20 incoming shipments of chemicals 
in drums arriving at a warehouse, show number of drums in shipment (X|), total weight 
of shipment (X 2 , in hundred pounds), and number of minutes required to handle 
shipment (K). 

f ： _1_2_3___18_19 20 一 

X a : 7 18 5 ... 21 6 11 

X ,- 2 ： 5.11 16.72 3.20 ... 15.21 3.64 957 

Vi ： 58 152 41 ... 155 39 90 


b. Regress Y on X\. X：. K 、，and X 4 , using a first-oider model and ordinary least sq uar 
obtain the residuals, and plot them against Y. What does the lesidual plot suggest? ’ 

c. Divide the cases into two groups, placing the 33 cases with the smallest fitted value p 
into group 1 and the other 32 cases into group 2 . Conduct the Brown-Forsythe testf/ 
constancy of the en l or variance, using a =； .01. State the decision rule and conclusion 

cl Plot the absolute residuals against and against X 4 . What do these plots suggest about 
the relation between the standard deviation of the error term and X- and X 4 ? 

e. Estimate the standard deviation function by regressing the absolute residuals a g a j nst 
X：^ and X 4 in hrst-order form, and then calculate the estimated weight for each case 
using (11.16a). 

f. Using the estimated weights, obtain the weighted least squares fit of the regression model 
Are the weighted least squares estimates of the regression coefficients simih' l(J the ones 
obtained with ordinary letLst squares in part (b)? 

g. Compare the estimated standard deviations of the weighted least squares coefficient esti 
mates in part (f) with those for the ordinary least squares estimates in part (b). What do 
you find? 

■9 

h. Iterate the steps in parts (e) and (f) one more time. Is there a substantial change in the 
estimated regression coefficients? If so, what should you do? 

1 1.9. Refer to Cosmetics sales Problem 10.13. Given below are the estimated ridge standanjized 

regression coefficients, the variance inflation factors, and R 2 for selected biasing constants c 


.00 


.01 


.02 


.04 


■06 


.08 


.09 


.10 


l fi 3 1 B 2 

b ^ b o o o R 

WWW 

/l\ /IV xfv 





Chapter 11 Building the Regression Model III: Remedied Measures 475 


Given below are the estimated ridge standardized regression coefficients, the variance inflation 
factors, and R 2 for selected biasing constants c. 


c 

.000 

.005 

.01 

.05 

.07 

.09 

.10 

.20 


.451 

.453 

.455 

.460 

.460 

.459 

.458 

.444 


.561 

.556 

.552 

.526 

.517 

.508 

.504 

.473 

(V//f)i = (VlF)z 

7.03 

6.20 

5.51 

2.65 

2.03 

1.61 

1.46 

.71 

R z 

.9869 

.9869 

.9869 

.9862 

.9856 

.9852 

.9844 

.9780 


a. Fit regression model (6.1) to the data and find the fitted values. 

b. Make a ridge trace plot for the given c values. Do the ridge regression coefficients exhibit 
substantial changes near c = 0? 

c. Why are the (VlF)i values the same as the (V7F) 2 values here? 

d. Suggest a reasonable value for the biasing constant c based on the ridge trace，the VIF 
values, and R 2 . 

e. Transform the estimated standardized regression coefficients selected in part (c) back to 
the original variables and obtain the fitted values for the 20 cases. How similar are these 
fitted values to those obtained with the ordinary least squares fit in part (a)? 

*11.11: Refer to Copier maintenance Problem 1.20. Two cases had been held out of the original data 

set because special circumstances led to unusually long service times: 


Case 

i X } Y t 

46 6 132 

47 5 166 

a. Using the enlarged (47-case) data set, fit a simple linear regression model using ordinaiy 

least squares and plot the data together with the fitted regression function. What is the 
effect of adding cases 46 and 47 on the fitted response function? .. 

b. Obtain the scaled residuals in (11.47) and use the Huber weight function (1 i .44) to obtain 
the case weights for a first iteration of IRLS robust regression. Which cases receive the 
smallest Huber weights? Why? 

c. Using the weights calculated in part (b), obtain the weighted least squares estimates of the 
regression coefficients. How do these estimates compare to those found in part (a) using 
ordinary least squares? 

d. Continue the IRLS procedure for two more iterations. Which cases receive the smallest 
weights in the final iteration? How do the final IRLS robust regression estimates compare 
to the ordinaiy least squares estimates obtained in part (a)? 

e. Plot the final IRLS estimated regression function, obtained in part (d), on the graph con¬ 
structed in part (a). Does the robust fit differ substantially from the ordinary least squares 
fit? If so, which fit is preferred here?'* 

1 i.12. Weight and height. The weights and heights of twenty male students in a freshman class are 

recorded in order to see how well weight (Y, in pounds) can be predicted from height (X, in 

inches). The data are given below. Assume that*first-order regression (I.f) is appropriate. 

_h _ 1 _2_3 ... 18_ 19 20 

X t ： 74 65 72 ... 69 68 67 

Yu 185 195 216 ... 177 145 137 




476 Part Two Muhiple Lhieur Re^ivs.skm 


a. Fit a simple linear regression model using ordinary least squares, and plot the data toag^ 

with the fitted regression function. Also, obtain an index plot of Cook's distance (10 Th 
What clo these plots suggest? ' 

b. Obtain the scaled residuals in (11.47) and use ihe Huber weight function (11.44) to obtain 

case weights fora first iteration of 1RLS robust regression. Which cases receive the small 
Huber weights? Why? ⑽ 


匕 


Using the weights calculated in pari (b), obtain the weighted least squares estimates of 加 
regression coefficients. How do these estimates compare to those found in part (a) usin 
ordinary least squares? 

d. Continue Lhe 1RLS procedure for two more iterations. Which cases receive the smallest 
weights in the final iteration? How do the linal 1RLS robust regression estimates compare 
to the orcliniiry least squares estimates obiained in pail (a)? 


Exercises 


11.13. (Calculus needed.) Derive the weighted least squares normal equations lor lilting a simple 
linear regression function when af = AX,-, where k is a proportionality constant. 

11.14. Express the weighted least sc|uares estimator b w \ in (t t .26a) in terms of the centered variables 
Y, — Y„, and Xj — X lr , where f", and X„. are the weighted means. 

11.15. Refer to Computer-assisted learning Problem 11.6. Demonstrate numerically that the 
weighted least squares estimates obtained in part (e) are identical to those obtained using 
transformation (11.23) and ordinary least squares. 

11.16. Refer to Machine speed Problem 11.7. Demonstraie numerically that the weighted least 
squares estimates obtained in p ： M't (e) are identical to those obtained when using transforma¬ 
tion (11.23) and ordinary least squares. 

11.17. Consider the weighted least squares criterion (11. 6 ) with weights given by = 3/X, . Setup 

ihe variance-covariance matrix for the error terms when / 二 1.4. Assume cr{e,-,〜-} —0 

for / ^ j. 

11.18. Derive the variance-covariance matrix c 2 {b„-} in (11.10) for the weighted least sqiuvres esti¬ 
mators when the variance-covariance miMi ix of the observations K, is 人 W — 1 ■ where W is given 
in (11.7) and k is n proportionality constiint. 

11.19. Derive the niean stjuared error in (11.29). 

11.20. Refer to the body fat example of Table 7.1. Employing least absolute residuals regression, (he 
LAR estimates of the regression coefficients at e b n — — 17.027, b\ = .4173, and b 7 = .5203. 

a. Find the sani of the absolute residuals based on the LAR fit. 

b. For the least squares estimated regression coefficients b 0 = — 19.174, b x = .2224, and 
l ?2 — .6594, Hnd the sum of the absolute residuals. Is this sum larger than the sum obtained 
in part (a)? Is this io be expected 1 ? 


PrOJGCtS 11.21. Observations on Y are to be taken when X = 10, 20, 30, 40, and 50, respectively. The true 

regression function is E{Y) = 20 + 10X. The error terms are independent and normally 
distributed, with £{〜} 二 0 and cr 2 {£ ( } = .8X,-. 

a. Generate a random Y observation for each X level and calculate both the ordinary and 
weighted least squares estimates of the regression coefficient )Sj in the simple linear re¬ 
gression function. 

b. Repeal part (a) 200 times, generating new random numbers each time. 



Chapter 11 Building the Regression Model III ： Remedial Measures 477 


c. Calculate the mean and variance of the 200 ordinary least squares estimates of and do 
the same for the 200 weighted least squares estimates. 

d. Do both the ordinary least squares and weighted least squares estimators appear to be 
unbiased? Explain. Which estimator appears to be more precise here? Comment. 

11.22. Refer to Patient satisfaction Problem 6.15. 

a. Obtain the estimated ridge standardized regression coefficients, variance inflation factors, 
and R 2 for the following biasing constants: c = .000, .005, .01, .02, .03, .04, .05. 

b. Make a ridge trace plot for the given c values. Do the ridge regression coefficients exhibit 
substantial changes near c = 0? 

c. Suggest a reasonable value for the biasing constant c based on the ridge trace, the VIF 

values, and R 2 . ^ 

d. Transform the estimated standardized regression coefficients selected in part (c) back to 

the original variables and obtain the fitted values for the 46 cases. How similar are these 
fitted values to those obtained with the ordinary least squares fit in Problem 6.15c? ^ 

[ 1.23. Cement composition. Data on the effect of composition of cetoent on heat evolved during 
hardening are given below. The variables collected were the amount of tricalcium alumi- 
nate (H the amount of tricalcium silicate (X 2 ), the amount of tetxacalcium alumino ferrite 
〆 (X 3 ), the amount of dicalcium silicate (X 4 ), and the heat evolved in calories per gram of 
cement (F). 


1 

1 

2 

3 

... 11 

12 

13 

Xn 

7 

1 

11 

… 1 

11 

10 

Xj 2 

26 

29 

56 

... 40 

66 

68 

X-,3 

6 

15 

8 

... 23 

9 

8 

X; 4 

60 

52 

20 

... 34 

12 

12 

Y f 

78.5 

74.3 

104.3 

... 83.8 

113.3 

109.4 


Ad^ted from H. Woods, H. R Steinour, and H. R, Starke, “Effect of Composition of Portland Cement on Heat 
Evolved During Hardening,” and Engineering Chemistry, 24,1932, 1207-1214. 


a. Ht regression model (6,5) for four predictor variables to the data. State the estimated 
regression function. 

b. Obtain the estimated ridge standardized regression coefficients, variance inflation factors, 
and R 2 for the following biasing constants: c = .000, .002, .004, .006, .008, .02, .04, .06, 
.08,.I0. 

c. Make a ridge trace plot for the biasing constants listed in part (b). Do the ridge regression 
coefficients exhibit substantial changes near c = 0? 

d. Suggest a reasonable value for the biasing constant c based on the ridge trace, VIF values, 
and R 2 values. 

e. Transform the estimated standardized ridge regression coefficients selected in part (d) to 
the original variables and obtain the fitted values for fhe 13 cases. How similar are these 

fitted values to those obtained with the ordinary least squares fit in part (a)? 

* ^ 

[ 1.24. Refer to Commercial properties Problem 6. \ 8. 

a. Use least absolute residuals regression to obtain estimates of the parameters A) ， 卢 l ， 房，你 ， 
and j6 4 . 

b. Find the sum of the absolute residuals based on the LAR fit in part (a). 







478 Part Two Multiple Linoar Regression 


c. For the least squares estimated regression function in Problem 6.18c, find the su 
the absolute residuals. Is this sum (urger than the sum obtained in part (b)? Is this to b 
expected? 

11.25. Crop yield. An agronomist siudied the effects of moisture (X ( , in inches) and temperatu 
(X 2 -in C) on the yield of a new hybrid tomato (K). The experimental data follow. 


/: 

1 

2 

3 

• ■ • 

23 

24 

25 

Xn ： 

6 

6 

6 


14 

14 

14 


20 

21 

22 

■ ■ ■ 

22 

23 

24 

n -： 

49.2 

48.1 

48.0 

■- ' 

42.1 

43.9 

40.5 


The agronomist expects that second-order polynomial regression model (8.7) wiLhind^' -ndent 

normal error terms is appropriate here. 

a. Fit a second-order polynomial regression model omitting the interaction term and the 
quadratic effect term for Lemperature. 

b. Construct a contour plot of the fitted surface obtained in part (a). 

c. Use the lowess method to ubia'm a nonparametric estimate of the yield response surface 
as a function of moisture and temperature. Employ weight function (1 i.53), q 9/25 
and a Euclidean distance measure with un.sctiled variables. Obtain fitted values Y h for the 

9x9 rectangular grid of (X /( i, U values where X / (1 = 6 . 7, - 13. 14 and X bl = 

20, 20.5,_23.5. 24. using a local tirst-order model. 

d. Construct a contour plot of the resulting lowess surface. Are the lowess contours consistent 
with the coniours in part (b) for the polynomial model? Discuss. 

11 .26. Refer to Computer-assisted learning Problem 11 . 6 . 

a. Based on the weighted least squares fit in Problem 11. 6 e, construct an approximate 95 per¬ 
cent confidence interval for ft] by means of (6.50), using the estimated standard deviation 

b. Using random X sampling, obtain 750 bootstrap samples of size 12. For each bootstrap 
sample, (1) use ordinary least squares to regress Y on X and obtain the residuals, (2) 
estimate the standard deviation function by regressing the absolute residuals on ^ and 
then use the fitted standard deviation function and (11.16a) to obtain weights, and (3) use 
weighted least squares to regress Y on X and obtain the bootstrap estimated regression 
coefficient /?(. (Note ihat for each bootstrap sample, only one iteration of the iteratively 
rewetghted least squares procedure is to be used.) 

c. Construct a histogram of the 750 bootstrap estimates b\. Does the bootstrap sampling 
distribution of l?\ appear to iipproximate a normal distribution? 

d. Calculate the sample standard deviation of ihe 750 bootstrap estimates / 小 How does tMs 
value compare to the estimated standard deviation } used in part (a)? 

e. Construct a 95 percent booistmp confidence interval for P\ using reflection meihod (11.59). 
How does; this confidence interval compare with that obmined in part (a)? Does (he ap¬ 
proximate interval in part (a) appear to be useful for this data set? 

11 .27. Refer to Machine speed Problem 11 .7. 

a. On the basis of the weighted least squares fit in Problem 11.7e, construct an approximate 
90 percent confidence interval for fi\ by means of (6.50), using the estimated standard 
deviation .v {/?„,]}. 

b. Using random X sampling, obtain 800 bootstrap samples of size 12. For each bootstrap 
sample, (1) use ordinary least squares to regress Y on X and obtain the residuals. (2) estimate 



Chapter 11 Building the Regression Model III: Remedial Measures 479 


the standard deviation function by regressing the absolute residuals on X and then use the 
fitted standard deviation function and (i 1.16a) to obtain weights, and (3) use weighted 
least squares to regress KonXand obtain the bootstrap estimated regression coefficient^*. 
(Note that for each bootstrap sample, only one iteration of the iteratively reweighted least 
squares procedure is to be used.) 

c. Construct a histogram of the 800 bootstrap estimates b*. Does the bootstrap sampling 
distribution of b* appear to approximate a normal distribution? 

d. Calculate the sample standard deviation of the 800 bootstrap estimates b*. How does this 
value compare to the estimated standard deviation ^{^i} used in part (a)? 

e. Construct a 90 percent bootstrap confidence interval for 爲 using reflection method (11.59). 
How does this confidence interval compare with that obtained in part (a)? Does the 
approximate interval in part (a) appear to be useful for this data set? 

11.28 - Mileage study. The effectiveness of a new experimental overdrive gear in reducing gasoline 
consumption was studied in 12 trials with a light truck equipped with this gear. In the data 
that follow, X ； denotes the constant speed (in miles per hour) on the^test track in the /th trial 
and Y- t denotes miles per gallon obtained. 


i: 1 2 3 4 5 6 7 8 9 10 11 12 

， X；: 35 35 40 40 45 45 50 50 55 55 60 60 

Y t ： 22 20 28 31 37 38 41 39 34 37 27 30 

Second-order regression model (8.2) with independent normal error terms is expected to be 

appropriate. 

a. Fit regression model (8.2). Plot the fitted regression function and the data. Does the 
quadratic regression function appear to be a good fit here? 

b. Automotive engineers would like to estimate the speed Xmax at which the average rhileage 

E{Y) is maximized. It can be shown for second-order model (8.2) that = 
X — (.5/3i /A I) ， provided that /3n is negative. Estimate the speed at which the average 

mileage is maximized, using 文腦 = 文 一 (.5bi/bn). What is the estimated mean mileage 
at the estimated optimum speed? 

c. Using fixed X sampling, obtain 1,000 bootstrap samples of size 12. For each bootstrap 
sample, fit regression model (8.2) and obtain the bootstrap estimate 

d. Construct a histogram of the 1,000 bootstrap estimate Does the bootstrap sampling 

distribution of X^ ]ax appear to approximate a normal distribution? 

e. Construct a 90 percent bootstrap confidence interval for 叉順 using reflection method 
(11.56). How precisely has been estimated? 

11.29. Refer to Muscle mass Problem 1.27- 

a. Fit a two-region regression tree. What is the first split point based on age? What is SSE for 

this two-region tree? 1 

b. Find the second split point given the two^region tree in part (a). What is SSE for the resulting 

three-region tree? „ 

c. Find the third split point given the three-region tree in part (b). What is SSE for the resulting 

four-region tree? * * 

d. Prepare a scatter plot of the data with the four-region tree in part (c) superimposed. How 
well does the tree fit the data? What does the tree suggest about the change in muscle mass 
with age? 

e. Prepare a residual plot of e,- versus Kf for the four-region tree in part (d). State your findings. 



480 Part Two Multiple Linear Reg ression 


Case 

Studies 


1 i.30. Refer to Patient satisfaction Problem 6. 15 - Consider only the first two 
age, X|, and severity of illness, X 2 ) ‘ 


predictors (pati en t* s 


a. Fit a two-region regression tree. What is the first split point, and on which predictor is jj 
based? What is SSE for the resulting two-region tree? 

b. Find the second split point given the two-region tree in part (a). Is it based on K { 0r ^,9 
What is SSE for the resulting three-region tree? 

c. Find the third split point given the three-region tree in part (b). Is it based 011 X] or Xji 

Whcii is SSE for the resulting four-region tree? 一 ' 

d. Find the fourth split point given the four-region tree in part (c). Is it based on X { 0 r X,? 
What is SSE for the resulting five-region tree? 

e. Prepare a three-dimensional surface plot of the five-region tree obtained in part (d). Whai 
does this tree suggest about the relative importance of the two predictors? 

f. Prepare a residual plot of e ； versus Y t for the five-region tree in part (d) ‘ State your findings. 


11-3L Refer to the Prostate cancer data set in Appendix C.5 and Case s Study 9.30. Select a random 

sample of 65 observations to use as the model-building data set. 

a. Develop a regression tree for predicting PSA. Justify your choice of number of regions 
(tree size), and interpret your regression tree. 

b. Assess your model's ability to predict and discuss its usefulness to the oncologists. 

c. Compare the perfomnince of your regression tree model with that of the best regression 
model obtained in Case Study 9.30. Which model is more easily interpreted and why? 

i i 32. Refer to the Real estate sales data set in Appendix C.7 and Case Study 9.31. Select a randon) 

sample of 300 observations to use as the mode 卜 building data set. 

a. Develop a regression tree for predicting sales price. Justify your choice of number of 
regions (tree size), and interpret your model. 

b. Assess your model's ability to predict and discuss its usefulness as a tool for predicting 
sales prices. 

c. Compare the performance of your I'egression tree model with that of the best regression 
model obtained in Case Study 9.3 i. Which model is more easily interpreted and why? 



Chapter ■ 零 / 


Autocorrelation in Time 
Series Data 


The basic regression models considered so far have assumed that the random error terms 
Ei are either unconelated random variables or independent normal random variables. In 
business and economics, many regression applications involve time series data. For such 
data, the assumption of uncorreiated or independent error terms is often not appropriate; 
rather, the error terms are frequently correlated positively over time. Error terms correlated 
over time are said to be autocorrelated or serially correlated. 

A major cause of positively autocorrelated error terms in business and economic regres¬ 
sion applications involving time series data is the omission of one or several key variables 
from the model. When time-ordered effects of such “missing” key variables are positively 
correlated, the error terms in the regression model will tend to be positively autocorrelated 
since the error terms include effects of missing variables. Consider, for example, the regres¬ 
sion of annual sales of a product against average yearly price of the product over a period 
of 30 years. If population size has an important effect on sales, its omission from the model 
may lead to the error terms being positively autocorrelated because the effect of population 
size on sales likely is positively correlated over time. 

Another cause of positively autocorrelated error terms in economic data is the presence 
of systematic coverage errors in the response variable time series, which errors often tend 
to be positively correlated over time. 


12.1 Problems of Autocorrelation 


When the error terms in the regression model are positively autocorrelated, the use of 
ordinary least squares procedures has a numbepof important consequences. We summarize 
these first, and then discuss them in more detail: 

1. The estimated regression coefficients are still unbiased, but they no longer have the 
minimum variance property and may be quite inefficient. 

2. MSE may seriously underestimate the variance of the error terms. 

3. s[b k ] calculated according to ordinary least squares procedures may seriously underes¬ 
timate the true standard deviation of the estimated regression coefficient. 


481 



482 Part Two Multiple Linear ression 


(1) 

(2) 


(3) 

Ut 

e t _i + u t = 

€t 

K t = 2 + -SX t + 

— 


3.0 

5.0 

.5 

3.0+ .5 = 

3.5 

6.0 

-.7 

3.5 一 .7 = 

2.8 

5.8 

.3 

2.8 + *3 - 

3.1 

6.6 

0 

3.1 + 0 

3.1 

7.1 

-2.3 

3.1 -2.3 = 

.8 

5.3 

-1.9 

.8—1.9 = 

-1.1 

3.9 

.2 

-1.1 + .2 = 

-.9 

4.6 

-.3 

—.9 — .3 = 

-1.2 

4.8 

.2 

-1.2 + .2 = 

-1.0 

5.5 

-.1 

-1.0- .1 - 

-1.1 

5.9 


4. Confidence intervals and tests using the t and F distributions, discussed eai lier, are 
longer strictly applicable. 

To illustrate these problems intuitively, we consider the simple linear regression mod 包 
with time series data: 

Yi = /So + jSi X, + e, 

Here, Y, and X, are observations for period t. Let us assume that the error terms e t 疵 
positively autocorrelated as follows: 


— ^i—i 


u, 


The uncalled disturbances, are independent normal random variables. Thus, any r iiorterm 
Sj is the sum of the previous- error term e,_i and a new disturbance term li } . We shall assume 
here that the have mean 0 and variance 1. 

In Table 12.1, column 1, we show 10 random observations on the normal variable u 
with mean 0 and variance 1, obtained from a standard normal random numbers generaton 
Suppose now that = 3.0 ； we obtain then: 

Si — Sq -f- iti — 3.0 + .5 = 3.5 
— 3>5 — .7 = 2.8 
etc. 

The error terms e t are shown in Table 12.1, column 2, and they are plotted in Figure 12 丄 
Note the systematic pattern in these error terms. Their positive relation over time is shown 
by the fact that adjacent error terms tend to be of the same sign and magnitude. 

Suppose that X } in the regression model represents time, such that Xj = 1, ^2 = 2, 
etc. Further, suppose we know that /0<) = 2 and j0j = .5 so that the true regression func- 
tion is E[Y) = 2+ .5X. The observed Y values based on the error terms in column2 
of Table 12.1 aie shown in column 3. For example, K 0 = 2 + .5(0) + 3.0 = 5.0, and 
Ki = 2 + .5(1) + 3.5 = 6.0. Figure 12.2a on page 483 contains the true regression line 
E{Y}=2 + ,5X and the observed Y values shown in Table 12.1, column 3. Figure 12.2b 
contains the estimated iiegression line, fitted by ordinary least squares methods, and repeats 


TABLE 12.1 

Example of 
Positively 
Autocorrelated 
Error Terms. 





芯 t 


Chapter 12 Autocorrelation in Time Series Data 483 


FIGURE 12.2 Regression with Positively Autocorrelated Error Terms. 

(a) True Regression Line and Observation (b) Fitted Regression Line and Observations 

when 农 o = 3 when 农 。 = 3 



(c) Fitted Regression Line and Observations with 


= —.2 and Drfferent Disturbances 



the observed Y values. Notice that the fitted regression line differs sharply from the true 
regression line because the initial £q value was large and the succeeding positively autocor¬ 
related error terms tended to be large for some time. This persistency pattern in the positively 
autocorrelated error terms leads to a fitted regression line far from the true one. Had the 
initial eq value been small, say, e 0 = —.2, and the disturbances different, a sharply different 




















o 

1 

9 

8 

7 

6 e 
£ 

5P 

4 


4 3 2 1 0 

EJai J0JJ3 


3 


2 


o 


2 3 


1 c * 

1Z CI 

slrx 

FIGEX POSI AUErrA 



484 Part Two Multiple Lhuw Rc^ivssion 


fitted regression line might liave been obtained because of the persistency pattern, as shown 
in Figure 12.2c. This variation from sample to sample in the fitted regression lines due to 
the positively autocorrelated error terms may be so substantial as to lead to large variances 
of the estimated regression coefficients when ordinary least squares methods are used 

Another key problem with apply ing ordinary least squares methods when the error term 
are positively ciutoconTelated, as mentioned before, is that MSE may seriously underestimate 
the variunce of the Figure 12.2 makes this clear Note that the variability of the Y 
values around the fitted regression line in Figure 12.2b is substantially smaller than the 
variability of the Y values around the true regression line in Figure 12.2a. This is one of 
the factors leading to an indication of greater precision of the regression coefficients than i s 
actually the case when ordinary least squares methods are used in the presence of positi' 叫 y 
autocorrehited errors. 

In view of the seriousness of the problems created by autocorrelated en ors, it is important 
that their presence be detected. A plot of residuals against time is an effective, though 
subjective, means of detecting autocorrelated enai's. Formal statistical tests have also been 
developed. A widely used tesi is based on the first-order autoregressive error model, whidi 
we take up next. This model is a simple one, yet experience suggests that it is frequently 
applicable in business and economics when the error terms are serially correlated. 


12.2 First-Oldej* Autoregressivc Error Model 


Simple Linear Regression 

The generalized simple linear regression model for one predictor variable when the random 
error terms follow a first-order autoregressive, or A/?(l), process is: 


= J0O + J0i X, + £j 
" p£j— i + w; 


( 12 . 1 ) 


where: 


p is a parameter such that \p\ < 1 
u, are independent N (0, a 2 ) 

Note that generalized regression model (12.1) is identical to the simple linear regression 
model (2.1) except for the structure of the error terms. Each error term in model (111) 
consists of a fraction of the previous error term (when p > 0) plus a new disturbance 
term u t . The parameter p is called the autocorrelation parameter. 

Multiple Regression 

The generalized multiple regression model when the random error terms follow a first-order 
autoregressive process is: 


Yj — A) + ]01 兄 1 + fhXn + ■ - • + + Sj 

£] = pG —I + 


( 12 . 2 ) 


where: 


Chapter 12 Autocorrelation in Time Series Data 485 


\p\ < i 

ui are independent N(0, cr 2 ) 

Thus, we see that generalized multiple regression model (12.2) is identical to the earlier 
multiple regression model (6.7) except for the structure of the error terms. 


properties of Error Terms 

Regression models (12.1) and (12.2) are generalized regression models because the error 
terms e t in these models are correlated. However, the error terms still have mean zero and 
constant variance: ^ 

E{s t ) = 0 ( 123 ) 


o 2 {£ t )-- 


<r 


-P 2 


( 12 . 4 ) 


Note that the variance of the error terms here is\ function of the autocorrelation parameter p. 


The covariance between adjacent error terms e t and £,_i is: 


o{E t ,s t - X } = p\ 




p 2 


) 


( 12 . 5 ) 


The coefficient of congelation between s t and s t — u denoted by p[s t , £>—}，is defined as 
follows: ‘ 

o{£ t ,E t - { ) 




( 12 . 6 ) 


a{s T }o{£ t ^ x } 

Since the variance of each error term according to (12.4) is <r 2 /(l — p 2 ), the coefficient of 
correlation using (12.5) is: 

( c - 2 
P _ ^ 

p{s t , g，—i} = — r = f — = P (12.6a) 


a 2 \ 


<r 


<r 


-P 2 ]j 1 - P 2 

Thus, the autocorrelation parameter p is the coefficient of cornelation between adjacent 
error terms. 

The covariance between error terms that are s periods apart can be shown to be: 


<7 {f ” 」二 p 


A) ( 


s^O 


( 12 . 7 ) 


and is called the autocovariance function. The coefficient of correlation between s t and s t — s 
therefore is: - 


p{E t , = p s 0 * ( 12 . 8 ) 

Note that (12.8) is called the autocorrelation function. Thus, when p is positive, all error 
terms are correlated, but the further apart they are, the less is the correlation between them. 
The only time the error terms for the autoregressive error models (12.1) and (12.2) are 
unccmelated is when / > = 0. 



486 Part Two Multiple Liuear Regremon 


From the results for the variances and covariances of the error terms in (12,4) and (]ii\ 
we can now state the variance-covariance matrix of the error terms for the ftrst-or^ 
autoregressive generalized regression models (12.1) and (12.2): 


<r 2 {e}= 


K 

Kp 


Kp 

K 


Kp — 
Kp 


Kp n ~ 


Kp 


n-2 


Kp 


Kp 


n~2 


KP 




K 


where: 


a- 


k — 




(12.9) 


(•49a) 


Note again that the variance-covariance matrix (12.9) neflects the generalized nature of 
regression models (12.1) and (12.2) by containing nonzero covariance terms. 


Comments 

1. It is instructive to expand the definition of the first-order autoregressive error term e , : 


£/ = p£ f -i + u, 

Since this definition holds for all r, we have £,„j = pe,— 2 + -i. When we substitute this expression 

above, we obtain ： 


£/ = piP^f-2 + + Hi — P 2 ^i-2 + + «, 

Replacing now £,_2 by p£,_ 3 + " ， — 2, we obtain ： 

£'； = p' 右卜 ，1 十 + pU\-\ + li \ 

Continuing in this fashion, we find; 


£/ = (12.10) 

.?=o 

Thus, the error term e, in period r is a linear combination of the current and preceding disturbance 
terms. When 0 < p < 1, (12,10) indicates that the further the period t — s is in the past, the smaller 
is the weight of disturbance term in determining e t . 

2. The derivation of (12.3). that the error terms have expectation zero, follows directly from taking 
the expectation of e, in (12.10) and using the fact that E{u,} = 0 for all r according lo models(12.1) 
and (12.2). 

3. To derive the variance of the error terms in (12.4), we utilize the assumption of models (111) 
and (12.2) that the u, are independent with variance a 2 . It then follows from (12.10) that: 


CX. 

A— 0 



•rxj 


^0 


2 .v 


Now for Spl < L it is known that: 


nc 


.N ~0 




Hence, we have: 


Chapter 12 Autocorrelation in Time Series Data 487 


1 — 〆 

4. To derive the covariance of e, and e r _i in (12.5), we need to recognize that: 

^ 2 {e t } = E{sf} 

<y{£t,£ t -\} = E{s t e t ^\} 

These results follow from (A. 15a) and (A.21a), respectively, since E{e t } = 0 by (12.3) for all t. 

By (12.10), we have: ^ 

E{e t £ t ^\ \ — E{(u ( + pUt-i + P^Ut-2 + ' ' ')(Mr—i + pUt-2 + p 2 M 卜 3 + ■•‘)} 

which can be rewritten: 

\ — E{\Ut + p(u t -i + PUt-2 + ' * .)][ M r-l + PUt-2 + 3 + ‘■‘]} 

= E{u t (Ut-i + pU t -2 + p 2 U t -3 + ■■•)} + E{p(u t -1 + pU t -2 + p 2 Ut -3 + ' ' *) 2 } 

Since _y} = 0 for all .y ^ 0 by the assumed independence of the u t and the fact that E{u t ] = 0 

for aH t, the first term drops out and we obtain: 

E{s t s t ^) = pE{s^_ l } = /ocr 2 {e r _,} 

Hence, by (12.4), which holds for aH t, we have: 

cr{ 。， £t—\} = P 

5. The first-order autoregressive error process in models (12.1) and (12.2) is the simplest kind. A 
second-order process would be: 

£i — P\£t~i + PiSt—2 + W/ (12.11) 

Still higher-order processes could be postulated Specialized approaches have been developed for 
complex autoregressive error processes. These are discussed in treatments of time series procedures 
and forecasting, such as in Reference 12.1. ■ 

Durbin-Watson Test for Autocorrelation_ 

The Durbin-Watson test for autocorrelation assumes the first-order autoregressive error 
models (12.1) or (12.2), with the values of the predictor Variable(s) fixed. The test consists 
of determining whether or not the autocorrelation parameter p in (12.1) or (12.2) is zero. 
Note that if = 0, then = u t ‘ Hence, the error terms s r are independent when p = 0 
since the disturbance terms u t are independent. 

Because correlated error terms in business and Economic applications tend to show 
positive serial correlation, the usual test alternatives considered are: 



Hq ： p = 0 

H a : p > 0 


( 12 . 12 ) 



488 Part Two Multiple Linear Regression 


Example 


The Durbin-Watson test statistic D is obtained by using ordinary least 
regression function, calculating the ordinary residuals: 


squares to 


ch 二 K, - n 


and then calculating the statistic: 


D = 





( 12 . 13 ) 


02 . 14 ) 


where n is the number of cases. 

Exact critical values are difficult to obtain, but Durbin and Watson have obtained lowl¬ 
and upper bounds and cly such that a value of D outside these bounds leads to a definite 
decision. The decision rule for testing between the alternatives in (12.12) is: 


If D > da, conclude H {) 
if D < di, conclude H a 
If di < D < da, the test is inconclusive 


( 12 . 15 ) 


Small values of D lead to the conclusion that p > 0 because the adjacent error terms 
and I tend to be of the same magnitude when they are positively autocorrelated. Hence, 
the differences in the residuals, e } — would tend to be small when p > 0, leading to a: 
small numerator in D and hence to a small test statistic D. 

T^ible B.7 contains the bounds and dy for various sample sizes (n), for two levels of ： 
significance (.05 and .01), and for various numbers of X variables (p — 1) in the regression 
model. 


The Blaisdell Company wished to predict its sales by using industry sales as a predictor 
variable. (Accurate predictions of industry sales are available from the industry’s trade 
association.) A portion of the seasonally adjusted quarterly data on company sales and. 
industry sales for the period 1998-2002 is shown in Table 12.2, columns 1 and 2. A scatter 
plot (not shown) suggested that a linear regression model is appropriate. The market research 
analyst was, however, concerned whether or not the error terms are positively autocorrelated 

The results of using ordinary least squares to fit a regression line to the data in Table 122 
are shown at the bottom of Table 12.2. The residuals e, are shown in column 3 of Table 122 ; 
and are plotted against time in Figure 12.3. Note how the residuals consistently are above 
or below the zero line for extended periods. Positive autocorrelation in the error terms is 
suggested by such a pattern when an appropriate regression function has been employed. 

The analyst wished to confirm this graphic diagnosis by using the Durbin-Watson test 
for the alternatives: 

H 0 ： p = 0 
H lt \ p > 0 

Columns 4, 5, and 6 of Table 12.2 contain the necessary calculations for the test statist^ 
D. The analyst then obtained: 

D = ™ = . 735 
ej -13330 



Chapter 12 Autocorrelation in Time Series Data 489 


FIGURE 12.3 
Residuals 
Plotted against 


Blmsdell 

Company 

Example. 


For level of significance of .01，we find in Table B.7 for n = 20 and p — l = l ： 

di = .95 dy = 1,15 

Since D = .735 fells below dt = .95, decision rule (12.15) indicates that the appropriate 
conclusion is H a , namely, that the error terms are positively autoconrelated. 

Comments * * 

1. Ifa test for negative autocorrelation is required, the test statistic to be used is 4 — £), where D 
is defined as above. The test is then conducted in the same manner described for testing for positive 
autocorrelation. That is, if the quantity 4 — £) falls below d L , we conclude p <0, that negative auto¬ 
correlation exists, and so on. 


R . r 】 2.2 Data, R^ression Results, and Durbin-Watson Test Calculations 一 BlaisdeH Company Example 
^Company an* 1 Industry Sales Data Are Seasonally Adjusted). 


I 辦 orf 

； l^iarter 

r • -J.'." 

2 

l 3 





0) 

Company 

Sales 

($ millions) 

(2) 

Industty 

Sales 

($ millions) 

⑶ 

Residual 

(4) 

(5) 

⑹ 

t 

J f 

Y t 

x t 

e t 

6( — €t~% 

(e f — e t ~%) 2 


1 

20.96 

127.3 

-.026052 

— 

— 

.0006787 

2 

21.40 

130.Q 

-.062015 

-.035963 

.0012933 

.0038459 

3 

21.96 

132.7 

.022021 

.084036 

.0070620 

.0004849 

4 

21.52 

129.4 

•16 与 754 

.141733 

.0200882 

mm 54 

, £r 'v » 

0 

27.52 

164.2 

.029112 

-.076990 

.0059275 

* s* * 

.0008475 

18 

27 JS 

165.6 

.042316 

.013204 

.0001743 

.0017906 

i9 

28.24 

168.7 

-.044160 

-.086476 

.0074781 

•0019501 

20 

28.78 

171.7 

—.033009 

.011151 

.0001243 

.0979400 

.0010896 

.1333018 


f = -1.4548+ .17628X 
s{&} = .21415 5(6,} = .00144 

MSE^.00741 




120 


6 


6 

m 

Ti 


4 


2 10 1 

■■•'■ 

o o o o 

1 

jenpis 01 


2 

d 



490 Part Two Multiple Linear Rc^ivssion 


2. A two-sided test for "(): p 


0 versus H a \ p ^ 0 can be made by employing both One-si 如 


tests separately. The Type 1 risk with the two-sided test is 2a. where a is the Type 1 risk for each 
one-sided test. 


3. When the Durbin-Wutson test employing the bounds cl L and cl v gives indeterminate results in 
principle more cases are required. Of course, with time series data it may be impossible to obtai 
more cases, or additional cases may lie in the future and be obtainable only with great delay. 
and Watson (Ref. 12.2) do give an approximate test which may be used when the bounds testis 
indeterminate, but the degrees of freedom should be larger than about 40 before this approximate test 
will give more than n rough indication of whether iiutocorrelation exists', 

A reasonable procedure is to treat indeterminate results as suggesting the presence of autocorrelated 
errors and employ one of the remedial actions to be discussed next. When remedial action does not 
lead to substantially diiferent regression results as ordinary least squares, the assumption of unco^ 
lated error terms would appear to be satisfactory. When the remedial action does lead to subsiantially 
different regression results (such as larger estimated standard errors for the regression coefficients 
or the elimination of autocorrelated errors), the results obtained by means of the remedial action are 
probably the more useful ones. 


4. The Durbin-Watson test is not robust against misspecifications of the model. For example, the 
Durbin-WatKon test may not disclose the presence of autocorrelated errors that follow the second-order 
autoregressive pattern in (12.11). 

5. The Durbin-Watson test is widely used; however, other tests for autocorrelation are available 

One such test, due to Theil and Nagar, is found in Reference 12.3. ■ 


12.4 Remedial Measures for Autocorrelation 


The two principal remedial measures when autocorrelated eiTor terms are present are tD add 
one or more predictor variables to the regression model or to use transformed variables. 

Addition of Predictor Variables 

As noted earlier, one major cause of autocorrelated eiroi' terms is the omission fii)m the 
model of one or more key predictor variables that have time-ordered effects on the response 
variable. When autocorrelated error terms are found to be present, the first remedial action 
should always be to search for missing key predictor variables. In an earlier illustration, we 
mentioned population size as a missing variable in a regression of annual sales of a product 
on average yearly price of the product during a 30-year period. 

When the long-term persistent effects in a response variable cannot be captured by one 
or several predictor variables, a trend component can be added to the regression model, such 
as a linear trend or an exponential trend. Use of indicator variables for seasonal effects, as 
discussed on pages 319-321，can be helpful in eliminating or reducing autocoiTelation in 
the error terms when the response variable is subject to seasonal effects (e.g., quarterly sales 
data). 

Use of Transformed Variables 

Only when use of additional predictor variables is not helpful in eliminating the problem of 
autocorrelated errors should a i*emedial action based on transformed variables be employed* 
A number of remedial procedures that rely on transformations of the variables have been 
developed. We shall explain three of these methods. Our explanation will be in terms of 
simple linear regression, but the extension to multiple regression is direct. 



Chapter 12 Autocorrelation in Time Series Data 491 


The three methods to be described are each based on an interesting property of the 
first-order autoregressive error term regression model (12.1), Consider the transformed 
dependent variable: 

Y ； = Y t ~pY t ^ 


Substituting in this expression for Y, and F,_| according to regression model (12,1), 
we obtain: 


f = (0o + + s { ) — p(Po + + £t-\) 

— A)(l — P) +^3i(X, — pX^i) + (s t — p£t-\) 

But, by (12,1), e { — ps t —i = u 卜 Hence: 

Y ； = ~p) + ^iXt~ pX t ^) + u t ( 12 . 16 ) 

where the u t are the independent disturbance terms. Thus, when we use the transformed 
variable 7/, the regression model contains error terms that are independeht. Further, model 
(12.16) is still a simple linear regression model with new X variable X\ = X t — as 

may be seen by rewriting (12.16) as follows: 

， Y l { =^ + ^r t +u { ( 12 . 17 ) 

where: 

Y ； = Y t -pY t ^ 

x; = Xt - pXt^i 

= A>(1 - p ) 

= Pi 

Hence, by use of the transformed variables X' t and Y t \ we obtain a standard simple linear 
regression model with independent error terms. This means that ordinary least squares 
methods have their usual optimum properties with this model. 

In order to be able to use the transformed model (12,17), one generally needs to estimate 
the autocorrelation parameter p since its value is usually unknown. The three methods to 
be described differ in how this is done. Often, however, the results obtained with the three 
methods are quite similar. 

Once an estimate of p has been obtained, to be denoted by r, transformed variables are 
obtained using this estimate of p: 


Y f t =Y t ~rY { ^ (12.18a) 

X， t = X t — rX t —、 、 (12.18b) 

Regression model (12.17) is then fitted to these transformed data, yielding an estimated 
regression function: * 

Y f = b^ + y x X r - - (12.19) 

If this fitted regression function has eliminated the autdfcornelation in the error terms, we 
can transform back to a fitted regression model in the original variables as follows: 

Y =b 0 + b i X (12.20) 



492 Part Two Multiple Linear Regression 


where: 


h 




1 — r 
bx = b\ 


02.20a) 

(12.20b) 


The estimated standard deviations of the regression coefficients for the original variables 
can be obtained from those for the regression coefficients for the transformed variables 
follows: 

1 — r 


s[bi }= 业 

Cochrane-Orcutt Procedure 

The Cochrane-Orcutt procedure involves an iteration of three steps. 


(12.21a) 
02 萬) 


Example 


1 - Estimation of p. This is accomplished by noting that the autoregressive error process 
assumed in model (12.1) can be viewed as a regression through the origin: 


Ei = pe 卜 I + m, 

where e } is the response variable, £卜 t the predictor variable, li } the eiTor term, and p the slope 
of the line through the origin. Since the s t and £ 卜【 are unknown, we use the residuals e, and 
e 卜 I obtained by ordinary least squares as the response and predictor variables, and estimate 
p by fitting a straight line through the origin. From our previous discussion of regression 
through the origin, we know by (4-14) that the estimate of the slope p, denoted by r, is: 


r — 





( 12 . 22 ) 


2. Fitting of transformecl model (12.17). Using the estimate r in (12.22), we next obtain 
the transformed variables Y[ and X; in (12,18) and use ordinary least squares with these 
transformed variables to yield the fitted regression function (12.19). 


3. Test for need to iterate. The Durbin-Watson test is then employed to test whether the 
error terms for the transformed model are uncorrelated. If the test indicates that they are 
unco ire la ted, the procedure terminates. The fitted regression model in the original variables 
is then obtained by transforming the regression coefficients back according to (12.20). 


If the Durbin-Watson test indicates that autocorrelation is still present after the first iter¬ 
ation ， the parameter p is reestimated from the new residuals for the fitted regression modd 
(12.20) with the original variables, which was derived from the fitted regression model 
(12.19) with the transformed variables. A new set of transformed variables is then obtained 
with the new r. This process may be continued for another iteration or two until the Durbin- 
Watson test suggests that the error terms in the transformed model are uncon'elated. If 
the process does not terminate after one or two iterations, a different procedure should be 
employed. 


For the Blais dell Company example, the necessary calculations for estimating the autocor¬ 
relation parameter p, based on the residuals obtained with ordinary least squares app^ed 
to the original variables, are illustrated in Table 12.3. Column [ repeats the residuals from 



0) 

(2) 

(3) 

(4) 

Y t 

x t 

Y[= ^-.631166^_1 

X f -.6B1166Xfr 

20.96 

127.3 

— 

— 

21.40 

130.0 

8.1708 

49.653 

21.96 

132.7 

8.4530 

50.648 

21.52 

129.4 

7,6596 

45.644 

27.52 

164.2 

» • » 

10.4911 

•'» » 

62.772 

27.78 

165.6 

10.4103 

61.963 

28.24 

168.7 

10.7062 

64.179 

28.78 

171.7 

10.9559 

65.222 


=-.3941 + .17376V 
s{jb' 0 } = .1672 5{^[ = .002957 

MSE =.00451 


TABLE 12.4 

•transformed 
Variables and 
R^ression 
Results for 
First Iteration 
with Cochrane- 
Orcutt 
Procedure — 
jNaisdell 
Company 
Example. 


Chapter 12 Autocorrelation in Time Series Data 493 


rj^BLE 12.3 

Calculations 

t 

(1) 

e t 

£ 

(3) 

(4) 

^•Estimating 

1 

-.026052 


— 

. 


2 

'062015 

-.026052 

.0016156 

.0006787 

^(jchrane- 

3 

.022021 

-.062015 

-.0013656 

.0038459 

Orcutt 

4 

.163754 

.022021 

.0036060 

.0004849 

procedure— 



? … 

腎 ■ • 


Blaisdell 

17 

.029112 

.106102 

.0030889 

.0112576 

Company 

18 

.042316 

.029112 

.0012319 

.0008475 

Exampfe* 

19 

-.044160 

.042316 

-.0018687 

.0017906 


20 

—.033009 

-.044160 

.0014577 

.0019501 


total 

r = 

Y^ e t^e t .0834478 

— .1322122 — 

.0834478 

.631166 

.1322122 


Table 12.2. Column 2 contains the residuals e f „,, and columns 3 and 4 contain the necessary 
calculations. Hence, we estimate: 

.0834478 

r = —■ — — = .631166 
.1322122 > 

We now obtain the transformed variables F/ anti X' t in (12.18): 

Y ； = y ； -.631166}；„, " 

X ； = X t - .631166X f J, * 

These are found in Table 12.4. Columns 1 and 2 repeat the original variables Y t and X t , 
and columns 3 and 4 contain the transformed variables Y t and X f r Ordinary least squares 
fitting of linear regression is now used with these transformed variables based on the « — 1 


7 8 9 n™ 

rttE 4—a MKB 




494 Part Two Multiple Linear Regression 


cases remaining after the transformations. The fitted regression line and other regression 
results are shown at the bottom of Table 12.4. The fitted regression line in the transformed 
variables is: 


where: 


Y' = -.3941 + .17376X 7 


02.23) 


Y ； = - .631166^„! 

X ； 二欠，一 .631166X ; „, 

Since the random term in the transformed regression model (12.17) is the disturbance 
term MSE = .00451 is an estimate of the variance of this disturbance term; recall that 
cr 2 {u,} = cr 2 . 

From the fitted regression function for the transformed variables in (12.23), residuals 
were obtained and the Durbin-Wats on statistic calculated. The result was (calculations not 
shown) D = 1,65. From Table B.7, we find for a = .01, /? — 1 = 1, and n = 19: 


di = .93 djj — 1.13 


Since D 二 1.65 > clu = 1.13, we conclude that the autocoiTelation coefficient for the error 
terms in the model with the transformed variables is zero. 

Having successfully handled the problem of autocoiTelated error terms, we now transform 
the fitted model in (12.23) back to the original variables, using (12.20): 




b' 


-.3941 


.631166 


: -1.0685 


bi " b\ - .17376 

leading to the fitted regression function in the original variables: 

Y = -1.0685 + .17376X ( 12 . 24 ) 

Finally, we obtain the estimated standard deviations of the regression coefficients for the 
original variables by using (12.21). From the results in Table 12.4, we find: 

sib^} .1672 


1 - r 1 - .631166 
s{bi} = .002957 


.45332 


Comments 

1. The Cochrane-Orcutt approach does not always work properly. A major reason is that when 
the error terms are positively autocorrelated, the estimate r in (12.22) tends to underestimate the 
autocorrelation parameter p. When this bias is serious, it can significantly reduce the effectiveness of 
the Cochrane-Orcutt approach. 

2. There exists an approximate relation between the Durbin-Watson test statistic D in (12.14)and 
the estimated autocorrelation parameter r in (12.22): 

O % 2(1 - r) (12.25) 

This relation indicates that the Durbin-Watson statistic ranges approximately between 0 and 4 
since r takes on values between — 1 and I, and that D is approximately 2 when r = 0. Mote that 



J 


55E 

p 

SSE 

■1170 

■94 

.0718 

.0938 

,95 

.07171 

.0805 

.96 

.07167 

.0758 

.97 

.07175 

.0728 

.98 

； 07197 

•0723 



For p 二 .96: 二 

= ： 07Mf+^604SX , 


SW Q ] = .05798 

s{^} = .006840 



MSE == .00422 


TABLE 12.5 

HiWreth-Ln 

Results ■— 

BlaisdeU 

Company 

Example. 


Chapter 12 Autocorrelation in Time Series Data 495 

for the Biaisdell Company example ordinary least squares regression fit, £)—.735,r = .631, and 
2(1 -r) = .738. 

3. Under certain circumstances, it maybe helpful to construct pseudotransformed values for period 
1 so that the regression for the transformed variables is based on «, rather than n~ 1, cases. Procedures 
for doing this are discussed in specialized texts such as Reference 12.4. 

4. The least squares properties of the residuals, such as that the sum of the residuals is zero, apply 

to the residuals for the fitted regression function with the transformed variables, not to the residuals 
for the fitted regression function transformed back to the original variables. ■ 

Hildreth-Lu Procedure 

The Hildreth-Lu procedure for estimating the autocorrelation parameter p for use in the 
transformations (12.18) is analogous to the Box-Cox procedure for estimating the param- 
eter 入 in the power transformation of Y to improve the appropriateness of the standard 
regression model. The value of p chosen with the Hildreth-Lu procedure is the one that 
minimizes the error sum of squares for the transformed regression model (12.17): 

SSE^ ^(Y ； - Y；) 2 = ^(Y ； - % - 办 '〆;) 2 (12.26) 

Computer programs are available to find the value of p that minimizes SSE. Alternatively, 
one can do a numerical search, running repeated regressions with different values of p for 
identifying the approximate magnitude of p that minimizes SSE. In the region of p that 
leads to minimum SSE, a finer search can be conducted to obtain a more precise value of p. 

Once the value of p that minimizes SSE is found, the fitted regression function cor¬ 
responding to that value of p is examined to see if the transformation has successfully 
eliminated the autocorrelation. If so, the fitted regression function in the original variables 
can then be obtained by means of (12.20). 

Table 12.5 contains the regression results for the Hildreth-Lu procedure when fitting the 
transformed regression model (12.17) to the Biaisdell Company data for different values 
of the autocorrelation parameter p. Note that SSE is minimized when p is near .96, so we 
shall let r = .96 be the estimate of p. The fitted regression function for the transformed 
variables corresponding to r = .96 and other regression results are given at the bottom of 
Table 12.5. The fitted regression function in the transformed variables is: 

Y f = .07117 + . 16045^ (12.27) 


Example 


> o o o 0 0 2 

p 1 3 5 7 9 9 



496 Part Two Multiple Linear Regws.sion 


where: 


Y ； = Y } - .96K,-! 

X\ = X, - .96X^! 

The Durbin-Watson test statistic for this fitted model \s D = 1.73. Since for n 
p — 1 = 1, and a = .01 the upper critical value iscly = 1.13, we conclude that no autocor 
relation remains in the transformed model. 

Thei'efore, we shall transform regi'ession function (12.27) back to the original variables 
Using (12.20), we obtain: 


V = 1.7793 + .16045X 

The estimated standard deviations of these regression coefficients are: 

s{b 0 }= 1.450 = .006840 


(12.28) 


Comments 6 

I. The Hildreth-Lu procedure, unlike the Cochrane-Orcutt procedure, does not require any itera¬ 
tions once the estimate of the autocoiTelation parameter p is obtained* 

2* Note from Table 12,5 that SSE as a function of p is quite stable in a wide region around the 
minimum, as is often the case. It indicates that the numerical search for finding the best wlueof p 
need not be too fine unless there is particular interest in the intercept term since the estimate is 
sensitive to the value of i\ 麵 

First Differences Procedure 

Since the autocorrelation parameter p is frequently large and SSE as a function of p often 
is quite flat for large values of p up to 1.0, as in the Blaisdell Company example, some 
economists and statisticians have suggested use of /o = 1.0 in the transformed model 02.17). 
Ifp= 1 ， 尻 /0o(l — p) = 0, and the transformed model (12.17) becomes: 

(12.29) 


y; = 


u. 


where: 


y ； = 

X ； = X, - 


(12.29a) 
(12.29b) 

Thus, again, the regression coefficient = 氏 can be directly estimated by ordinary least 
squares methods, this ti me based on regression through the origin. Note that the transformed 
variables in (12.29a) and (12.29b) are ordinary first differences. It has been found that this 
first differences approach is effective in a variety of applications in reducing the autocorre¬ 
lations of the error terms, and of course it is much simpler than the Cochrane-Orcutt and 
Hildreth-Lu procedures. 

The fitted regression function in the transformed variables: 


Y' = b\ X 1 

can be transformed back to the original variables as follows: 

Y = b {) b\X 


(12.30) 


(12.31) 



where: 


Chapter 1 2 Autocorrelation in Time Series Data 497 


Y = -.30349 + .16849X (12.33) 


(1) 

(2) 

(3) 

(4) 

Y t 

A 


X l t = X t - Xt 

20.96 

127.3 

— 

— 

21.40 

130.0 

.44 

2.7 

21.96 

132.7 

.56 

, 2.7 

21.52 

129.4 

-.44 

-3.3 

27.52 

164.2 

.54 

3.5 " 

27.78 

165.6 

.26 

1.4 

28.24 

168.7 

.46 

^ 3,1 

28.78 

171.7 

•54 

* 3.0 


? , = A6849X , 

s{b f ,} = .005096 MSE= .00482 


TABLE 12.6 

First 

Differences and 
Regression 
Results with 
First 

Differences 
I^ocedure — 
Blaisdell 
Company 
Examrie. 


b 0 = Y-b\X (12.31a) 

b { = b\ (12.31b) 

Ikble 12.6 illustrates the transformed variables F/ and X' r based on the first differences 
transformations in (12.29a, b) for the Blaisdell Company example. Application of ordinary 
least squares for estimating a linear regression through the origin leads to the results shown 
at the bottom of Table 12.6. The fitted regression function in the transformed variables is: 

Y' = . 16849^ (12.32) 

where: 

y; = 

= x t — 

To examine whether the first differences procedure has removed the autocorrelations, 
we shall use the Durbin-Watson test. There are two points to note when using the Durbin- 
Watson test with the first differences procedure. Sometimes the first differences procedure 
can overcorrect,leading to negative autocorrelations in the error terms. Hence, it may be 
appropriate to use a two-sided Durbin-Watson test when testing for autocoraelation with 
first differences data. The second point is that the first differences model (12.29) has no 
intercept term, but the Durbin-Watson test requires a fitted regression with an intercept 
term. A valid test for autocorrelation in a no-intercept model can be carried out by fitting for 
this purpose a regression function with an intercept term. Of course, the fitted no-intercept 
model is still the model of basic interest. 

In the Blaisdell Company example, the Durbin-Watson statistic for the fitted first dif¬ 
ferences regression model with an intercept term is jD = 1.75. This indicates uncorrelated 
error terms for either a one-sided test (with a = .01) ora two-sided test (with a = .02). 

With the first differences procedure successfully eliminating the autocornelation, we 
return to a fitted model in the original variables by using (12.31): 



7 8 9 0 
1112 



498 Part Two Multiple Linear Regression 


Estimate of cr 
(MSB) 

■0045 

.0042 

.0048 


TABLE 12.7 

Major 

R^ression 

Procedure 

hi 


Results for 

Cochrane-Orcutt 

■ 1738 

.0030 

Three Trans¬ 

Hildreth-Lu 

.1605 

.0068 

formation 

First differences 

.1685 

.0051 

Procedures — 

Blaisdell 

Company 

Ordinary least squares 

.1763 

.0014 




Example. 


where: 


b {) = 24.569 - .16849(147.62) = -.30349 

We know from Table 12.6 that the estimated standard deviation of b\ is s{b\} = .005096 
since bi = b\. 

Comparison of Three Methods 

Table 12.7 contains some of the main regression results for the three transformation methods 
and also for the ordinai'y least squares regression fit to the original variables. A number of 
key points stand out: 

1. All of the estimates of /0| are quite close to each other. 

2. The estimated standard deviations of based on Hildi'eth-Lu and fii-st differences trans¬ 
formation methods are quite close to each other; that with the Cochrane-Orcutt (roce- 
dure is somewhat smaller. The estimated standard deviation of b x based on ordinary 
least squares regression with the original variables is still smaller. This is as expected, 
since we noted earlier thai the estimated standard deviations s{b/；} calculated according 
to ordinary least squares may seriously underestimate the true standard deviations a{b k } 
when positive autocorrelation is present. 

3. All three ti.ansfoirnation methods provide essentially the same estimate of cr 2 , the vari¬ 
ance of the disturbance terms it 卜 

The three transformation methods do not always work equally well, as happens to be the 
case here for the Blaisdell Company example. The Cochrane-Orcutt procedure may fail to 
i-emove autocorrelation in one or two iterations, in which case the Hildretli-Lu or the first 
differences procedures may be preferable. When several of the transformation methods are 
effective in removing autocorrelation，then simplicity of calculations may be considered in 
choosing fmm among these pi*ocedures. 

Comment 

Further discussions of the Cochrane-Orcutt, Hildreth-Lu, and first differences procedures, as well as 
of other remedial procedures for autocorrelated errors, may be found in specialized texts, such as 
Reference 12.4. _ 


r 


3 6 
6 9 



Chapter 1 2 Autocorrelation in Time Series Data 499 


Forecasting with Autocorrelated Error Terms_ 

One important use of autoregressive error regression models is to make forecasts. With these 
models, information about the error term in the most recent period n can be incorporated 
into the forecast for period n + \. This provides a more accurate forecast because, when 
autoregressive error regression models are appropriate, the error terms in successive periods 
are correlated. Thus, if sales in period n are above their expected value and successive error 
terms are positively correlated, it follows that sales in period n + \ will likely be above their 
expected value also. 

We shall explain the basic ideas underlying the development of forecasts using the 
presence of autocorrelated error terms by again employing the simple linear autoregressive 
error term regression model (12.1). The extension to multiple regression model (12.2) is 
direct First, we consider forecasting when either the Cochrane-Orcutt or the Hildreth-Lu 
procedure has been utilized for estimating the regression parameters. 

When we express regression model (12.1): ^ 

Yt = A) + A X/ + f r 

by using the structure of the error terms: 

〆 

’ £{ — pSi -1 + Uf 

we obtain: 

Yt = Po P^t-\ + u t 

For period n + 1, we obtain: 

In+I = A) + + p£ n + W/ 2+1 (12.34) 

Thus, Y n+ i is made up of three components: 

1. The expected value jSo + ^ X n+l . 

2. A multiple p of the preceding error term s n , 

3. An independent, random disturbance term with E{u n+ i) = 0. 

The forecast for next period " +1, to be denoted by 〖 ，is constructed by dealing with 

each of the three components in (12.34): 

1. Given X n+l , we estimate the expected value + 爲 X n+ | as usual from the fitted regres¬ 
sion function: 

Yn+\ — + biX n+ i 

where b Q and b ] are the estimated regression coefficients for the original variables 
obtained from b r ^ and b\ for the transformed variables according to (12.20). 

2. p is estimated by r in (12.22), and s n is estimated by the residual e n : 

e n = Y n ~ (b 0 + b { X n ) = Y n - Y n 

fr 

Thus, ps n is estimated by re n . 

3. The disturbance term u n +i has expected value zero and is independent of earliet infor¬ 
mation. Hence, we use its expected value of zero in the forecast. 

Thus, the forecast for period 打 + 1 is: 

Fn+\ ~ in+i + ^e n (12.35) 



500 Part Two Multiple Linear Regression 


Example 


An approximate 1 — a prediction interval for y„+i <llcw) , the new observation on the 
sponse variable, may be obtained by employing the usual prediction limits for a new obserT 
tion in (2.36), but based on the transformed observations. Thus, Y ； and X,- in formula (2 3 ^' 
for the estimated variance 5 2 {pred} are replaced by Y' t and X' { as defined in (12.18) ^ 

The approximate 1 — a prediction limits for y, i+ i( liew) with simple linear regression 
therefore are: 

F lt+y ± t(\ - a/2; n - 3)_s {pi.ed} (12.3 句 

where .s{pred}, defined in (2.38a), is here based on the transformed observations. Note the 
use of n — 3 degrees of freedom for the t multiple, since there are only n — 1 transformed 
cases and two degrees of freedom are lost for estimating the two parameters in the simple 
linear regression function. 

When forecasts are based on the first differences procedure, the forecast in (12.35) is 
still applicable, but r = 1 now. The estimated standard deviation ^{pred} now is calcdaied 
according to formula (4.20) in Table 4.1 for one predictor variable, using the transformed 
variables. Finally, the degrees of freedom for the t multiple ui (12.36) will be n — 2 , since 
only one parameter has to be estimated in the no-intercept regression model (12.29). 

For the Blaisdell Company example, the trade association has projected that deseasonalized 
industry sales in the first quarter of 2003 (i.e., quarter 21 ) will be X 2 \ = $175.3 million. 
To forecast Blaisdell Company sales for quarter 21 ， we shall use the Cochrane-Orcutt fitted 
regression function (12.24): 


1.0685 + .17376X 


First, we need to obtain the residual ^ 20 ： 

e 2 o = ^20 - Y 2 o = 28.78 — [-1.0685 + .17376(171.7)] = .0139 

The fitted value when Xi\ = 175.3 is: 

K 21 = -1.0685 + .17376(175.3) = 29392 

The forecast for period 21 then is: 

Fn = Y 2 \ + ^20 = 29.392 + .631166(.0139) = 29.40 

Note how the fact that company sales in quarter 20 were slightly above their estimated mean 
has a small positive influence on the forecast for company sales for quarter 21 . 

We wish to set up a 95 percent prediction interval for K 2 i( !iew) . Using the data for the 
transformed variables in Table 12.4, we calculate 5 {pred} by (2.38) for: 

X ；, +1 = X „ +1 - .631166X,, = 175.3 — .631166(171.7) = 66.929 

We obtain ^{pred} = .0757 (calculations not shown). We require ； (.975; 17) = 2.110. We 
therefore obtain the prediction limits 29.40 ± 2.110(.0757) and the prediction interval: 

29.24 < K 21(new) < 29.56 

Given quarter 20 seasonally adjusted company sales of $28.78 million and other past sales, 
and given quarter 21 industry sales of $175.3 million, we predict with approxiinately 95per¬ 
cent confidence that seasonally adjusted Blaisdell Company sales in quarter 21 will be 
between $29.24 and $29.56 million. 



Chapter 1 2 Autocorrelation in Time Series Data 501 


To obtain a forecast of actual sales including seasonal effects in quarter 21, the Blaisdell 
Company still needs to incorporate the first quarter seasonal effect into the forecast of 
seasonally adjusted sales. 

The forecasts with the other transformation procedures are very similar to the one with 
the Cochrane-Orcutt procedure. With the first differences estimated regression function 
(12.33), the forecast for quarter 21 is: 

F 2l = [-30349 + .16849(175.3)] + 1.0[28.78 + .30349 - .16849(171.70)] = 29.39 

The estimated standard deviation ^{pred} calculated according to (4.20) with the trans¬ 
formed data in Table 12.6 is sfpred} = .0718 (calculations not shown). For a 95 percent 
prediction interval, we require r(.975; 18) = 2.101. The prediction limits tKerefore are 
29.39 ± 2.101(.0718) and the approximate 95 percent prediction interval is: f 

29.24 < F 2 i(new) < 29.54 

This forecast is practically the same as that with the Cochrane-Orcutt estimates. 

The approximate 95 percent prediction interval with the estimated regression func¬ 
tion (12.28) based on the Hildreth-Lu procedure is (calculations not shown): 

， 29.24 < y 21(new) < 29.52 

This forecast is practically the same as the other two. 

Comments 

1. Forecasts obtained with autoregressive error regression models (12.1) and (12.2) are conditional 
on the past observations Y n -\-, etc. They are also conditional on X H+I , which often has to be 
projected as in the Blaisdell Company example. 

2. Forecasts for two or more periods ahead can also be developed, using the recursive relations of 
e t to earlier error terms developed in Section 12.2. For example, given X n+ 2 the forecast for period 
« + 2, based on either Cochrane-Orcutt or Hildreth-Lu estimates, is: 

F „+2 = Y n +2 + r 2 e„ (12.37) 

For the first differences estimates, the forecast in (12.37) is calculated with r = 1. 

3. The approximate prediction limits (12.36) assume that the value of r used in the transfor¬ 
mations (12.18) is the true value of p; that is, r = p,\i that is the case, the standard regression 
assumptions apply since we are then dealing with the transformed model (12.17). To see that the 
prediction limits obtained from the transformed model are applicable to the forecast F n+X in (12.35), 
recall that cr 2 {pred} in (2.37) is the variance of the difference K /J(new) — Y h . In terms of the situation 
here for the transformed variables, we have the following correspondences: 

■^/ ； (new) corresponds to = ~■ 

Yh corresponds to 巧十 i + b\r n+x ^bdl~r)+b x (X,^-rX n ) 

The difference Y^ l+1 — 巧 +1 is: 

K+i ~~ ^n+i = - 办 o(l — r )— 办 I (X n+ i — rX n ) ‘ 

= Y n +\ — (bo + b\X n+ {) — r(Y n — — b\X tl ) 

~ — -^«+l 一 re n 

— y n +l ~ ^n+l 

Hence, F n+1 plays the role of F/^new) and^ + i plays the role of F/, in (2.37). The prediction limits (12.36) 
are approximate because r is only an estimate of p. ■ 



502 Part Two Multiple Linear Regression 


Cited 

References 


Problems 


12, L Box, G, E, R. and G, M. Jenkins. Tune Series Analysis. Forccctstin^ and ContwL Rev 
Francisco: Holden-Day. 1976. 

12.2. Durbin. J-, and G. S. Watson. "Testing for Serial Correlation in Least Squares Regressi : 

Bi ( 川 tetrilat 38 (1951), pp. 159-78. 0n - D,” 

123. Theil, H„ and A, L, Nagar, “Testing lhe Independence of Regression Disturbances" J OUr 
the American Statistical Association 56 (1961), pp. 793-806. 




12.4, Greene. W, H, Econometnc Analysis. 5th ed. Upper Saddle River. New Jersey ； Prentice 


2003. 


12,1. 


Refer to Table 12, L 



a. 


Plot e, against £,_■ for t = 1.10 on a graph. How is the positive first-orde 

relation in the error terms shown by the plot? 


r autocoT- - 


12 . 2 . 

12.3. 


b, 1C you plotted u, against £, j for t — 1.10, what pattern would you expect? 


Refer to Plastic hardness Problem L22, If the same test item were measured at 12 different 
points in time, would the error terms in the regression model likely bfc autocorrelated? Discuss 


A student stated that the first'oi'def autoregressive error models (12.1) and (12.2) are too simple- 
for business ti me series data because the error term in period t in such data is also influenced 
by random effects that occurred more than one period in the past. CommenL 


12.4, A student writing a term paper used ordinary least squares in fitting a simple linear regression^ 
model to some time series data containing positively autocorrelated errors, and Found that the 
90 percent confidence interval for fi\ was too wide to be useful. The student then decided to" 
employ regression model (12.1) to improve the precision of the estimate. Comment, 

12.5, For each of the following tests concerning the autocorrelation parameter p in regression 
model (12,2) with three predictor variables, state the appropriate decision rule based on the 
Durbin-Watson test statistic for a sample of size 38: (1) //": /o = 0, H tl : p ^ Q, a = 

(2) H {) : p = 0, H t ,'. /? < 0, a = ,05: (3) H {) \ p = 0, H tl : p > 0,« = ,01. 

* 12.6 - Refer to Copier maintenance Problem 1.20. The observations are listed in time order. Assumei 
that regression model (12,1) is appropriate. Test whether or not positive autocorrelation fc 
present: use a = ,01, State the alternatives, decision rule, and conclusion. 


12.7. Refer to Grocery retailer Problem 6.9, The observations are listed in time order. Assume that 
regression model (12.2) is appropriate. Test whether or not positive autocorrelation is present; 
use a = .05, State the alternatives, decision rule, and conclusion. 


12.8, Refer to Crop yield Problem 11.25. The observations are listed in time order. Assume that 
regression model (12,2) with first- and second-order terms for the two predictor variables and 
no interaction term is appropriate. Test whether or not positive autocorrelation is present; use 
a — .01, State the alternatives, decision rule, and conclusion- 


* 12.9, Microcomputer components. A staff analyst for a manufacturer of microcomputer corrpo- 
nents has compiled monthly data for the past 16 months on the value of industry production of 
processing units that use these components (X, in million dollars) and the value of the firm's 
components used (Y. in thousand dollars). The analyst believes that a simple linear regression 
relation is appropriate but anticipates positive autocorrelation. The data follow; 


f:_1_2 3_... 14_15_ 

X ( : 2,052 2.026 2,002 ... 2,080 2.102 2.150 

K ( : 102,9 101.5 100.8 104.8 105,0 107,2 





Chapter 12 Autocorrelation in Time Series Data 503 


a. Fit a simple linear regression model by ordinaiy least squares and obtain the residuals. 
Also obtain s{6 0 } and 

b. Plot the residuals against time and explain whether you find any evidence of positive 
autocorrelation. 

c. Conduct a formal test for positive autocorrelation using a = .05. State the alternatives, de¬ 
cision rule, and conclusion. Is the residual analysis in part (b) in accord with the test result? 

*12.10. Refer to Microcomputer components Problem 12.9. The analyst has decided to employ 
regression model (12.1) and use the Cochrane-Orcutt procedure to fit the model. 

a. Obtain a point estimate of the autocorrelation parameter. How well does the approximate 

relationship (12.25) hold here between this point estimate and the Durbin-Watson test 
statistic? ^ 

b. Use one iteration to obtain the estimates and b\ of the regression coefficients ^ and 

in transformed model (12.17) and state the estimated regression function. Also obtain 

c. Test whether any positive autocorrelation remains after the first iteration using a = .05, 
State the alternatives, decision rule, and conclusion. 

d. Restate the estimated regression function obtained in part (b) in terms of the original vari¬ 
ables. Also obtain ^{^o} and Compare the estimated regression coefficients obtained 
with ttie Cochrane-Orcutt procedure and their estimated standard deviations with those 
obtained with ordinaiy least squares in Problem 12.9al 

e. On the basis of the results in parts (c) and (d), does the Cochrane-Orcutt procedure appear 
to have been effective here? 

f. The value of industry production in month 17 will be $2,210 million. Predict the value of 
the firm’s components used in month 17; employ a 95 percent prediction interval. Interpret 
your interval. 

g. Estimate 爲 with a 95 percent confidence interval. Interpret your interval. 

*12.11. Refer to Microcomputer components Problem 12.9. Assume that regression model (12.1) 
is applicable. 

a. Use the Hildreth-Lu procedure to obtain a point estimate of the autocorrelation parameter. 
Do a search at the values /o = .1, .2,..., 1.0 and select from these the value of p that 
minimizes SSE. 

b. From your estimate in part (a), obtain an estimate of the transformed regression func¬ 
tion (12.17). Also obtain s{%} andsf^,}. 

c. Test whether any positive autocorrelation remains in the transformed regression model; 
use a = .05. State the alternatives, decision rule, and conclusion. 

d. Restate the estimated regression function obtained in part (b) in terms of the original 

variables. Also obtain j{^ 0 } and Compare the estimated regression coefficients 

obtained with the Hildreth-Lu procedure and their estimated standard deviations with 
those obtained with ordinary least squares in Problem 12.9a. 

e. Based on the results in parts (c) and (d), has the Hildreth-Lu procedure been effective here? 

f. The value of industry production in month 17 will be $2.210 miHion. Predict the value of 
the firm’s components used in month 17; employ a 95 percent prediction interval Interpret 
your interval. 

g. Estimate A with a 95 percent confidence interval. Interpret your interval. 

*12.12. Refer to Microcomputer components Problem 12.9. Assume that regression model (12.1) 
is applicable and that the first differences procedure is to be employed. 



504 Part Two Multiple Linear Rcgresxio” 


a. Estimate the regression coefficient fi\ in the transformed regression model (12 29、 

obtain the estimated standiird deviation of this estimate. State the estimated ree . ' , nd 
function. n 

b. Test whether or not the error terms with the first differences procedure are uutocorrela 
using a two-sided test and a = , 10. State the alternatives, decision rule, and conch . ’ 
Why is a two-sided test meaningful here'? 

c. Restate the estimated regression function obtained in pail (a) in terms of the origi 
variables. Also obtain s{b t }, Compare the estimated regression coefficients obtainedw'th 
the first differences' procedure and the estimated standard deviation s{b t } with the results 
obtained with ordinary least squares in Problem 12.9a. 

d. On the basis of the results in parts (b) and (c). has the first differences procedure been 
effective here? 


e. The value of industry production in month 17 will be $2.210 million. Predict the value of 
the firm's components used in month 17; employ a 95 percent prediction interval, lnterpr^ 
your interval 

f. Estimate /A with a 95 percent confidence interval. Interpret your interval. 


12,13. Advertising agency. The managing partner of an advertising agency is interested in the 
possibility of making accurate predictions of monthly billings. Monthly data on amount of 
billings (Y. in thousands of constant dollars) and on number of hours ol" staff time (X in 
thousand hours) for the 20 most recent months follow. A simple linear ■•egression model is 
believed to be appropriate, but positively autocorrelated error terms may be present. 


f: 1 2 3 ... 18 19 20 


X,: 2,521 2.171 2.234 … 3,1 1 7 3.623 3,618 

r,: 220.4 203.9 207,2 … 252,4 278,6 278.5 


a. Fit a simple linear regression model by ordinary least squares and obtain the residuals. Also 
obtain a {/? ( >} and .v{/?|}. 

b. Plot the residuals against time and explain whether you find any evidence of positive 
autocorrelation. 

c. Conduct a formal test for positive autocorrelation using a = ,0L State the alternatives, 
decision rule, and conclusion. Is the residual analysis in part (b) in accord with the test 
result? 


12.14. Refer to Advertising agency Problem 12.13. Assume that regression model (12.1) is appli¬ 
cable and that the Cochrane-Orcutt procedure is to be employed. 

a. Obtain a point estimate of the autocorrelation parameter. How well does the approximate 
relationship (12,25) hold here between the point estimate and the Durbin-Watson test 
statistic? 

b. Use one iteration to obtain the estimates b' {) and b\ of the regression coefficients ^ and 

in transformed model (12.17) and state the estimated regression function. Also obtain 
sib'^} and s{l ?\}. 

c. Test whether any positive autocorrelation remains after the first iteration using a = -01- 
State the alternatives, decision rule, and conclusion. 

d. Restate the estimated regression function obtained in part (b) in terms of the original vari¬ 
ables. Also obtain s{b 0 ] and }. Compare the estimated regression coefficients obtained 



Chapter 1 2 Autocorrelation in Time Series Data 505 


with the Cochrane-Orcutt procedure and their estimated standard deviations with those 
obtained with ordinaiy least squares in Problem 12.13a. 

e. Based on the results in parts (c) and (d), does the Cochrane-Orcutt procedure appear to 
have been effective here? 

f. Staff time in month 21 is expected to be 3.625 thousand hours. Predict the amount of 
billings in constant dollars for month 21, using a 99 percent prediction interval. Interpret 
your interval. 

g. Estimate A with a 99 percent confidence interval. Interpret your interval. 

12.15. Refer to Advertising agency Problem 12.13. Assume that regression model (12.1) is 

applicable. 

a. Use the Hildreth-Lu procedure to obtain a point estimate of the autocorrelation parameter. 
Do a search at the values p = .1, .2,..., 1.0 and select from these the value of p that 
minimizes SSE. 

b. Based on your estimate in part (a), obtain an estimate of the transformed regression func¬ 
tion (12.17). Also obtain s(b' 0 } and s{b\}. ^ 

c. Test whether any positive autocorrelation remains in the transformed regression model; 
use a == .01. State the alternatives, decision rule, and conclusion. 

d ^Restate the estimated regression function obtained in part (b) in terms of the original 
• variables. Also obtain s{^ 0 } and Compare the estimated regression coefficients 

obtained with the Hildreth-Lu procedure and their estimated standard deviations with 
those obtained with ordinary least squares in Problem 12.13a. 

e. Based on the results in parts (c) and (d), has the Hildreth-Lu procedure been effective here? 

f. Staff time in month 21 is expected to be 3.625 thousand hours. Predict the amount of 
billings in constant dollars for month 21, using a 99 percent prediction interval. Interpret 
your interval. 

g. Estimate fix with a 99 percent confidence interval. Interpret your interval. 

12.16. Refer to Advertising agency Problem 12.13. Assume that regression model (12.1) is appli¬ 
cable and that the first differences procedure is to be employed. 

a. Estimate the regression coefficient in the transformed regression model (12.29) and 
obtain the estimated standard deviation of this estimate. State the estimated regression 
function. 

b. Test whether or not the emw terms with the first differences procedure are autocorrelated, 
using a two-sided test and a = ,02. State the alternatives, decision rule, and conclusion. 
Why is a two-sided test meaningful here? 

c. Restate the estimated regression function obtained in part (a) in terms of the original 
variables. Also obtain jpi}. Compare the estimated regression coefficients obtained with 
the first differences procedure and the estimated standard deviation j{^,} with the results 
obtained with ordinary least squares in Problem 12.13a. 

d. Based on the results in parts (b) and (c) Jtias the first differences procedure been effective 
here? 

e. Staff time in month 21 is expected to be 3.625 thousand hours. Predict the amount of 
billings in constant dollars for month 21, usin^a 99 percent prediction interval. Interpret 
your interval. 

f. Estimate with a 99 percent confidence interval Interpret your interval. 

12.17. McGill Company sales. The data below show seasonally adjusted quarterly sales for the 

McGill Company (F, in million dollars) and for the entire industry (X, in million dollars) for 



506 Part Two Multiple Linear Regression 


the mosi recent 20 cjuarters. 


t: 


1 


2 


3 


18 


19 


X ( ： 


1273 

20.96 


130,0 

21.40 


132.7 

21,96 


165.6 

27,78 


168.7 

28,24 


20 

172.0 

28.78 


a. 


c. 


ive^ 


Would you expect the autocorreUition parameter p to be positive, negative, or zero he v 
Fit a simple linear regression model by ordinary leasi squares and obtain the & 

Also obtain ^[bo] and s[b \ 1 ♦ 

Plot the residuals againsi time and explain whether you find any evidence of p os y 
autocorrelation, 1 

.-.S 

d. Conduct a formal test for positive autocorrelation using a = .01. State the alternatiy 
decision rule, and conclusion. Is the residual analysis in part (c) in accord with the ^ 
result? i 

12.18, Refer to McGill Company sales Problem 12.17. Assume that regression model {\2 ^ ^ 

applicable and that the Cochrane-Orcutt procedure is to be employed. ； 

a. Obtain a point estimate of the autocorrelation parameter. How well does the approximate 1 " 

relationship (12.25) hold here between the poim estimate and the Durbin-Watson test? 
statistic? :' 

b. Use one iteration to obtain the estimates b f 0 and b\ of the regression coefficients ^ and 

in transformed model (12.17) and state the estimated regression function. Also obtain 
L v{/?(,} and Hh\), } 

c. Test whether any positive autocorrelation remains after the first iteration; use a = ,01 - 
State the alternatives, decision rule, and conclusion. 

d. Restate the estimated regression function obtained in pan (b) in terms of the original 
variables. Also obtain s{b{)\ and s k {b\ Compare the estimated regression coefficients； 

ii 

obtained with the Cochrane-Orcutt procedure and their estimated standard deviations wit| 
those obtained with ordinary least squares in Problem 12,17b. 

e. On the basis of the results in parts (c) and (d), does the Cochrane-Orcutt procedure appear 
to have been effective here? 

f. Industry sales for quarter 21 are expected to be $ 181.0 million. Predict the McGill Compaq 
sales for quarter 21, using a 90 percent prediction interval. Interpret your interval. 

g. Estimate with a 90 percent confidence interval. Interpret your interval. 

12.19, Refer to McGill Company sales Problem 12, 17. Assume that regression model (12.1) is 

applicable. 


a. Use the Hildreth-Lu procedure to obtain a point estimate of the autocorrelation parameter. 

Do a search at the values p — . I. .2.1.0 and select from these the value of p ttet 

minimizes SSE. 

b. Based on your estimate in part (aj, obtain an estimate of the transformed regression ftme- 

lion (12.17). Also obtain 5 1 {/?[,} and }. 

c. Test whether any positive autocorrelation remains in the transformed regi^ession model; 
use a = .01. Suite the alternatives, decision rule, and conclusion. 

d. Restate the estimated regression function obtained in part (b) in terms of the original 
variables. Also obtain .v{/?[>} and }. Compare the estimated regression coefficients 
obtained with the Hildreth-Lu procedure and their estimated standard deviations with 
those obtained with ordinary least squares in Problem 12.17b. 




Chapter 12 Autocorrelation in Time Series Data 507 





ffiterdses 


e. Based on the results in parts (c) and (d), has the Hildreth-Lu procedure been effective 
here? 

f. Industry sales for quarter 21 are expected to be $181.0 million. Predict the McGill Company 
sales for quarter 21, using a 90 percent prediction interval. Interpret your interval. 

g. Estimate Pi with a 90 percent confidence interval. Interpret your interval. 

12.20. Refer to McGill Company sales Problem 12.17. Assume that regression model (12.1) is 
applicable and that the first differences procedure is to be employed. 

a. Estimate the r^ression coefficient in the transformed regression model (12.29) and 
obtain the estimated standard deviation of this estimate. State the estimated regression 
function. 

b. Test whether or not the error terms with the first differences procedure are positively 
autocorreiated using a = .01. State the alternatives, decision rule, and conclusion. •, 

c. Restate the estimated regression function obtained in part (a) in terms of the original 
variables. Also obtain Compare the estimated regression coefficients obtained with 
the first differences procedure and the estimated standard deviation with the results 
obtained with ordinary least squares in -Problem 12.17b. 

d. On the basis of the results in parts (b) and (c), has the first differences procedure been 
effective here? 

〆 

e. Industry sales for quarter 21 are expected to be $ 181.0 million. Predict the McGill Company 
sales for quarter 21, using a 90 percent prediction interval. Interpret your interval. 

f. Estimate with a 90 percent confidence interval. Interpret your interval. 

12.21. A student applying the first differences transformations in (12.29a, b) found that several 
values equaled zero but that the corresponding Y' t values were nonzero. Does this signify that 
the first differences transformations are not appropriate for the data? 


12.22. Derive (12.7) for s = 2. 

12.23. Refer to first-order autoregressive error model (12.1). Suppose Y t is company’s percent share 
of the market, X, is company’s selling price as a percent of average competitive selling price, 
) 6 o = 100, ^ = —.35, p = . 6 , cr 2 = 1, and s 0 = 2.403. Let X, and u, be as follows for 
t = 1 ， … ， 10 : 

t: 1 2 3 4 5 6 7 8 9 10 


Xt ： 100 115 1 20 90 85 75 70 95 105 110 

u t ： .764 .509 -.242 -1.808 -.485 .501 -.539 .434 -.299 .030 


a. Plot the true regression line. Generate the observations' Y t (t = 1,..., 10), and plot these 
on the same graph. Fit a least squares regression line to the generated observations Y t and 
plot it also on the same graph. How does your fitted regression line relate to the true line? 

b. Repeat the steps in part (a) but this time let /o = 0. In which of the two cases does the fitted 
regression line come closer to the true line? Is this the expected outcome? * 

c. Generate the observations Y t for p = —.7. For each, of the cases p — p — 0, and 
p = —.1, obtain the successive error term differences s t — s t _\ (t = 1, - • - ， 10 ). 

d. For which of the three cases in part (c) is Y](£t — £t-i) 2 smallest? For which is it largest? 
What generalization does this suggest? 



508 Part Two Multiple Linear Regression 


12.24. For multiple regression model (12.2) with " 
the random terms are uncorrelated. 


2. derive the transformed model 


b whicj[ 


12.25. Suppose the autoregressive error process for the model Y t = f% + /-]| X, + e, is that - 

by (12.11). gIVetf 

a. What would be the transformed variables Y[ and X\ for which the mndom terms' 

regression model are uncorrelated? - 

b. How would you estimate the parameters p\ and p] lor use with the Cochrane-Orcu 

procedure? : 

c* How would you estimate the parameters p\ and with the Hildreth-Lu procedure? 

12.26. Derive the forecast F tt+ \ for a simple linear regression model with the second-order am : 

gressive error process {12J 1), 飞 


Projects 12,27， The true regression model is K, = 10 + 24X, + where s t — ，如卜 ■ + n, and h, are inde^ 

pendent /V(0, 25), "； 

a. Generate 11 independent random numbers from N(0, 25), Use the lirst random number 

obtain the 10 error terms e 卜 …， and then calculate the 10 observations i 7 〗， … 

corresponding to X| = L X 2 = 2, - X m = KX Fit a linear regression function by orfi- 

nary least squares and calculate MSE t 

b. Repeat part (a) 100 times, using new random numbers each time. 

c\ Calculate the mean of the 100 estimates of b\. Does it appear that^i is an unbiased estimated 
of fi\ despite the presence of positive autocorrelation? 

d. Calculate the mean of the 100 estimates of MSE. Does it appear that MSE is a biased' 
estimator of rr 2 ? If so. does the magnitude of the bias appear to be small or large? 


Case 

Studies 


12.28 - Refer to the Website developer data set in Appendix C.6 and Case Study 9.29. The observa¬ 
tions are listed in time order. Using the model developed in Case Study 9.29, test whether or 
not positive autocorrelation is present; use a = ,0L IF autocorrelation is present, revise the 
model and analysis as needed. 

12.29* Refer to the Heating equipment data set in Appendix C.8. The observations are listed in. 
time order ‘ Develop a reasonable predictor model for the monthly heating equipment orders; 
Potential predictors include new homes forsale, cuirent monthly deviation of temperature from 
historical average temperature, the prime lending rate, current distributor inventory leveis, the 
amount of distributor sell through, and the level of discounting being offered Your tmalysis 
should determine whether or not autocorrelation is present using a — .05. If autocoinelation 
is present, revise the model and analysis as needed. 





Part 






lin6ar 


res sion 


ITT 



Chapter 


Introduction to Nonlinear 
Regression and Neural 
Networks 

The linear regression models considered up to this point are generally satisfactory approxi¬ 
mations for most iiegression applications. Tlieix; are occasions, however, when an empirically 
indicated or a theoretically justified nonlinear regression model is more appropriate. For 
example, growth from birth to maturity in human subjects typically is nonlinear in nature, 
characterized by rapid growth shortly after birth, pronounced growth during puberty, and 
a leveling off sometime before adulthood. In another example, dose-response relationships 
tend to be nonlinear with little or no change in response for low dose levels of a ding, fol¬ 
lowed by rapid S-shaped changes occurring in the more active dose region, and finally with 
dose response leveling off as it reaches a saturated level. We shall consider in this chapter 
and the next some nonlinear regression models, how to obtain estimates of the regression 
parameters in such models, and how to make inferences about these regression parameters. 

In this chapter, we introduce exponential nonlinear regression models and present the 
basic methods of nonlinear regression. We also introduce neural network models, which are 
now widely used in data mining applications. In Chapter 14, we present logistic regression 
models and consider their uses when the response variable is binary or categorical with 
more than two levels. 


13.1 Linear and Nonlinear Regression Models__ 

Linear Regression Models 

In previous chapters, we considered linear regression models, i.e., models that are linear in 
the parameters. Such models can be represented by the general linear regression model (6.7): 

Yi = )00 + )S| X/i + ^2^i2 + ' … + jS/j-l + (13J) 

Linear regression models, as we have seen, include not only first-order models in p _ 1 
predictor variables but also more complex models. For instance, a polynomial regression 
model in one or more predictor variables is linear in the parameters, such as the following 


510 



Chapter 1 3 Introduction to Nonlinear Regression and Neural Networks 511 


model in two predictor variables with linear, quadratic, and interaction terms; 

Y'l = A) + ^l^tl + + ^3^12 + ^4^12 + + Si (1 3.2) 

Also, models with transformed variables that are linear in the parameters belong to the class 
of linear regression models, such as the following model: 

log io K = A) + A + Pi exp(X,2) + £i (13.3) 

In general, we can state a linear regression model in the form: 

+ (13.4) 

where X ； is the vector of the observations on the predictor variables for the ith case: 


足 M 

Xj = . 


(13.4a) 


P is the vector. 0 f the regression coefficients in (6.18c), and f(X t , p) represents the expected 
value £■{>/}, which for linear regression models equals according to (6.54): 


/(X ( -, P) = x；-p 


(13.4b) 


Nonlinear Regression Models 

Nonlinear regression models are of the same basic form as that in (13.4) for linear regression 
models: 


= /(Xi,y) + ^ (13.5) 

An observation Yi is still the sum of a mean response /(X,-, y) given by the nonlinear 
response function /(X, y) and the error term The error terms usually are assumed to 
have expectation zero, constant variance, and to be uncorrelated, just as for linear regression 
models. Often, a normal error model is utilized which assumes that the error terms are 
independent normal random variables with constant variance. 

The parameter vector in the response function /(X, y) is now denoted by y rather than 
P as a reminder that the response function here is nonlinear in the parameters. We present 
now two examples of nonlinear regression models that are widely used in practice. 

Exponential Regression Models. One widely used nonlinear regression model is the 
exponential regression model. When there is only a single predictor variable, one form of 
this regression model with normal error terms is: — 

Yi = yo exp(j/!X t -) + e f , (13.6) 


where: 


yo and w are parameters 
X t are known constants 
£i are independent N(0, o 2 ) 



512 Part Three Nonlinear Regression 


The response function for this model is: 

/(X ， y) = yoexp^X) 

Note that this model is not linear in the parameters Yq and }/|. 


03.7) 


A more general nonlinear exponential regression model in one predictor variabl 
normal error terms is: 


e with 


^ = H) + Yi exp(y 2 Xi) + £ t 

where the error terms are independent normal with constant variance a 2 . The 
function for this regression model is: 

/(X ， y) = Ko + Yi exp(y 2 X) 


03.8) 

response 


(13.9) 


Exponential regression model (13.8) is commonly used in growth studies where the rate 
of growth at a given time X is proportional to the amount of growth remaining as time 
increases, with yo representing the maximum growth value. Another ^se of this regression 
model is to relate the concentration of a substance (F) to elapsed time (X). Figure 13 i a 
shows the response function (13.9) for parameter values y 0 = 100, n = —50, and y 2 = -2. 
We shall discuss exponential regression models (13.6) and (13.8) in more detail later in this 
chapter. 


Logistic Regression Models. Another important nonlinear regression model is the logis¬ 
tic regression model. This model with one predictor variable and normal error terms is: 




Yo 

1 +Yi exp(j/ 2 ^) 


+ 


(13.10) 


where the error terms 灼 are independent normal with constant variance cr' The response 


FIGURE 13.1 
Plots of 
Exponential 
and Logistic 
Response 
Functions. 


(a) (b) 

Exponential Model (13.8): Logistic Model (13.10): 

E{Y} = 100 — 50 exp(-2X) E{Y} = 10/[1 + 20 exp(-2X)] 





function here is: 


Chapter 13 Introduction to Nonlinear Regression and Neural Networks 513 


/(X ， y)= 


Yo 

1 + Vi Gxp(y 2 X) 


(13.11) 


Note again that this response function is not linear in the parameters yo, Ki ， and 

This logistic regression model has been used in population studies to relate, for instance, 
number of species (F)to time (X). Figure 13.1b shows the logistic response function (13.11) 
for parameter values j/ 0 = 10, = 20, and Y 2 ― —2. Note that the parameter yo = 10 

represents the maximum growth value here. 

Logistic regression model (13.10) is also widely used when the response variable is 
qualitative. An example of this use of the logistic regression model is predicting whether 
a household will purchase a new car this year (will, will not) on the basis of the predictor 
variables age of presently owned car, household income, and size of household. In this 
use of logistic regression models, the response variable (will, will not purchase car, in our 
example) is qualitative and will be represented by a 0, 1 indicator variable. Consequently, 
the error terms are not normally distributed here with constant variance. Logistic regression 
models and their use when the response variable is qualitative will be discussed in detail in 
Chapter 14. ， 

General Form of Nonlinear Regression Models. As we have seen from the two examples 
of nonlinear regression models, these models are similar in general form to linear regression 
models. Each ^ observation is postulated to be the sum of a mean response /(X,-, y) based 
on the given nonlinear response function and a random error term e,-. Furthermore, the 
error terms ^ are often assumed to be independent normal random variables with constant 
variance. 

An important difference of nonlinear regression models is that the number of regression 
parameters is not necessarily directly related to the number of X variables in the model. 
In linear regression models, if there are p — 1 X variables in the model, then there are 
p regression coefficients in the model. For the exponential regression model in (13.8), there 
is oneX variable but three regression coefficients. The same is found for logistic regression 
model (13.10). Hence, we now denote the number of X variables in the nonlinear regression 
model by q, but we continue to denote the number of regression parameters in the response 
function by p. In the exponential regression model (13.6), for instance, there sue p = 2 
regression parameters and q = \ X variable. 

Also, we shall define the vector X,- of the observations on the X variables without the 
initial element 1. The general form of a nonlinear regression model is therefore expressed 
as follows: 

1 

Yt = + (13.12) 

where: ^ 


Xi = 

qxl 

' 兄 r 

x i2 

y = 

px 1 

「 ** _i 

Yo 

Y\ 

> 

(13.12a) 


x iq _ 


—y P - i— 




514 Part Three Nonlinear Regression 


Comment 


Nonlinear response functions thut can be linearized by n trdn.sformiUion are sometimes called intrin 
steady linear rcspan.se functions. For exiimple. the exponential response function ： 

/{X. y) = Kolexp{}/,^)| 


is an intrinsically linear response function because it can be linearized by the loganth m i c 
transformation ： 

log,. /(X. y) = log ( , Yi) + Y\X 


This transformed response runctioii can be represeiUed in the linear model form: 

只 (X- y)=+ 办 X 

where y) = log ( , /(X. y). fi {) = log,. j/ () . and /^i = ]/|- 

Just because a nonlinear response function is intrinsically linear does noi necessarily imply th^ 
linear regression is appropriate. The reason is that the transformation to linearize the response function 
will effect the error term in the model. Foi. example, suppose that the following exponential regression 
model with normal error terms that have constant variance is appropriate: 

Yi = )/,, exp(}/, X^) + a ； 

A logiirithmic iransformation of Y to linearize the response function will affect the normal error term 
so that the error term in the linear!zed model will no longer be normal with constunt variance. Hence, 
it is important to study any nonlinear regression model that has been linearized for appropriateness; 
it may turn out that the nonlinear regression model is preferable to the linearized version. 讎 


Estimation of Regression Parameters 

Estimation of the parameters of a nonl'inear regress'ion model is usually carried out by the 
method of least squares or the method of maximum likelihood, just as for linear regres¬ 
sion models. Also as in linear regression, both of these methods of estimation yield the 
same parameter estimates when the error terms in nonlinear regression model (13.12) are 
independent normal with constant variance. 

Unlike linear regression, it is usually not possible to find analytical expressions fw 
the least squares and maximum likelihood estimators: for nonlinear regression modds. 
Instead, numerical search procedures must be used with both of these estimation procedures, 
requiring intensive computations. The analysis of nonlinear regression models is therefwe 
usually carried out by utilizing standard computer software programs. 


Example 


To illustrate the fitting and analysis of nonlinear regression models in a simple fashion, 
we shall use an example where the model has only two parameters and the sample size 
is reasonably small. In so doing, we shall be able to explain the concepts and procedures 
without overwhelming the reader with details. 

A hospital administrator wished to develop a regression model for predicting the de¬ 
gree of long-term recovery after discharge from the hospital for severely injured patients. 
The predictor variable to be utilized is number of days of hospitalization (X), and the 
response variable is a prognostic index for long-term recovery (K), with large values of 
the index reflecting a good prognosis. Data for 15 patients were studied and are presented 
in Table 13.1. A scatter plot of the data is shown in Figure 13.2. Related earlier studies 
reported in the literature found the relationship between the predictor variable and the re¬ 
sponse variable to be exponential. Hence, it was decided to investigate the appropriateness 
of the two-parameter nonlinear exponential regression model (13.6): 

y, = yoexpf^X,) + e, ( l3? ^ 



Chapter 13 Introduction to Nonlinear Regression and Neural Networks 515 



Y= 58.6065 exp(-.03959X) 


FIGURE 13.2 

Scatter Plot 

and Fitted 

Nonlinear 

Regression 

Function — 

Severely 

Injured 

Patients 

Examide. 


10 20 30 40 50 60 70 
Days Hospitalized 

where thee ； are independent normal with constant variance. If this model is appropriate, it 
is desired to estimate the regression parameters y 0 and y 卜 ■ 

13.2 Least Squares Estimation in Nonlinear Regression 

- .v ―— —. — -t ___~ — —- - CJ 

We noted in Chapter 1 that the method of least squares for simple linear regression requires 
the minimization of the criterion Q in (1.8): 

n 

Q = J][Yi-(Po + A ^/)] 2 ( 13 . 14 ) 


605040302010 

x a) pul ul +J souol OJ D. 


dsx; 

In 私為 , 0 ;?s K 5一5、 ' 印6:.'06 3 - 8 1 60%..-6 
0gihd.'.ri 5 v 5々.4 3 T 1 

r0 _ : m 


ed 

l p r 


IS o 4 '.9 \6;1 夺 8 5 ..2 3 .© : 5 
1 1 1 .2 3 3 3 4 5 5 6 6 


'nt 


念 f .'8 ,.0^2 s 

，- p 


lr nl 

TABSI 



516 Part Three Nonlinear Regression 


Those values of j0 o and that minimize Q for the given sample observations (X. y 、 
the least squares estiinatef ； and are denoted by /?„ and b { ' 1 ^ - 
We also noted in Chapter 1 that one method for finding the least squares estimates" ^ 
by use of a numerical search procedure. With this tipproach, Q in (13.14) is evaluaf 
different values ot'j0 o and j0[, varying 0() und fii systematically until the minimum value fn 
is found. The values of fi {) and j0[ that minimize Q are the least squares estimates and/ 
A second method for finding the least squares estimates is by means 1 of the least square - 
normal equations. Here, the least squares normal equations are found analytically by 师饮 ； 
entiating Q with respecl to and /0| and setting the derivatives equal to zero. The solutio 
of the normal equations yields the least squares estimates. 

As we saw in Chapter 6, these procedures extend directly to multiple linear regression fo r 
which the least squares criterion is given in (6.22). The concepts of least squares estimation ^ 
for linear regression also extend directly to nonlinear regression models. The least squares 
criterion again is: 


Example 


q = J2 IYi ~ f (X ^y )]2 ( 13 . 15 ) 

/-I 

where /(X,-, y) is the mean response for the ith case according to the nonlinear response 
function /(X, y). The least squares criterion Q in (13.15) must be minimized with respect 
to the nonlinear regression parameters yo, yi,..., y r ~\ to obtain the least squares estimates. 
The same two methods for finding the least squares estimates — numerical search and normal 
equations — may be used in nonlinear regression. A difference from linear regression is that 
the solution of the normal equations usually requires an iterative numerical search procedure 
because analytical solutions generally cannot be found. 

The response function in the severely injured patients example is seen from ( 13.13) tD be: 

/(X, y) = y 0 exp(y, X) 

Hence, the least squares criterion Q here is: 

It 

e 二以 - Yo exp(y, X, )] 2 

/ 

We can see that the method of maximum likelihood leads to the same criterion here 
when the error terms £, are independent normal with constant variance by considering the 
likelihood function: 

L(y ， a2) = a^)"r- ex P — p [ LK, ~ y< ) expty, X, )] 2 

Just as for linear regression, maximizing this likelihood function with respect to the regres* 
si on parameters y 0 and y\ is equivalent to minimizing the sum in the exponent so that the 
maximum likelihood estimates are the same here as the least squares estimates. 

We now discuss how to obtain the least squares estimates, first by use of the normal 
equations and then by direct numerical search procedures. 





Chapter 13 Introduction to Nonlinear Regression and Neural Networks 517 


Ration 


of Normal Equations 

To obtain the normal equations for a nonlinear regression model: 

Yi = f(X [ ,y) + s i 

we need to minimize the least squares criterion Q: 

n 

G = ^[^-/(X,-,y)] 2 

/ =i ‘ 

with respect to yo, Ki ， - _ _, V P -i- The partial derivative of Q with respect to y k is: 


Example 


dQ 

3)4 


= 2 ⑺-肌 刈 


9/(X,,y) 


(13.16) 


When the p partial derivatives are each set equal to 0 and the parameters y k are replaced by 
the least squares estimates g k , we obtain after some simplification the p normal equations: 


E^- 


3/(X,-,y) 


y=g 


5^/(x,-,g) 


9/(X ； ,y) 

dfk 


y=g 


众= 0，1 — 1 

(13.17) 


where g is the vector of the least squares estimates g k : 


g 

px I 


go 


—gp-i J 


(13.18) 


Note that the terms in brackets in (13.17) are the partial derivatives in (13.16) with the 
parameters y k replaced by the least squares estimates g k . 

The normal equations (13.17) for nonlinear regression models are nonlinear in the pa¬ 
rameter estimates g k and are usually difficult to solve, even in the simplest of cases. Hence, 
numerical search procedures are ordinarily required to obtain a solution of the normal equa¬ 
tions iteratively. To make things still more difficult，multiple solutions may be possible. 


In the severely injured patients example, the mean response for the ith case is: 

/(X ( -, y) = y 0 exp(j/,X^ 

Hence, the partial derivatives of /(X；, y) are: 

a/(X,, y) ' 


(13.19) 


a/(X f , y) 


=exp(y,X ( .) 


YoXi exp(j/,X ( -) 


(13.20a) 


(13.20b) 



518 Part Three Nonlinear Regression 


Replacing y 0 and y\ in (13.19), (13.20a), and (13.20b) by the respective least squares 
estimates go and g], the norinal equations (13.17) therefore are: 

Yi exp(^, X^) - ^ go exp(^|X,) exp(^,X, ) = 0 

Y i^ X i exp(g|X ( ) - (幻 X,) 沿 ) X, exp(g, Xj) = 0 

Upon simplification, the normal equations become: 

^ Y, expf^iX^) - g o y" exp(2g|X,-) =0 

y, Y.X； exp(g,X, ) - x < exp(2g,X f ) = 0 

These normal equations are not linear in g 0 and g], and no closed-form solution exists 
Thus, numerical methods will be required to find the solution for the least squares estimates 
iteratively. 


Direct Numerical Search — Gauss-Newton Method 

In many nonlinear regression problems, it is more practical to find the least squares estimates 
by direct numerical search procedures rather than by first obtaining the normal equations 
and then using numerical methods to find the solution for these equations iteratively. The 
major statistical computer packages employ one or more direct numerical search procedwes 
for solving nonlinear regression problems. We now explain one of these direct numerical 
search methods. 

The Gauss-Newton method, also called the Hnearization method, uses a Taylor series 
expansion to approximate the nonlinear regression model with linear terms and then employs 
ordinary least squares to estimate the parameters. Iteration of these steps generally leads to 
a solution to the nonlinear regression problem. 

The Gauss-Newton method begins with initial or starting values for the regression 
parameters H) ， yi ， … ， Yp-\- We denote these by gQ 0> , 只 ; ⑴， .… g { ^,, where the superscript 
in parentheses denotes the iteration number. The starting values gf 5 ) may be obtained from 
previous or related studies, theoretical expectations, or a preliminary search for parameter, 
values that lead to a comparatively low criterion value Q in (13.15). We shall later discuss 
in more detail the choice of the starting values. 

Once the starting values for the parameters have been obtained, we approximate the 
mean responses /(X ( , y) for the n cases by the linear terms in the Taylor series expansion, 
around the starting values g^ 0> . We obtain for the fth case ： 


where: 


/(X„y)^/(X ； f g <0, )+^ 


9 /( x m y) 


^Yk J y={ ,(0) 



(13.21) 


1 


g (o) = 

I 



U 


<o) 

p— i 」 


(13.214 ； 




Chapter 13 Introduction to Nonlinear Regression and Neural Networks 519 


Note that g (0) is the vector of the parameter starting values. The terms in brackets in (13.21) 
are the same partial derivatives of the regression function we encountered earlier in the 
normal equations (13.17)，but here they are evaluated at y k = for 众 = 0,1 ， ... ， p — 1. 
Let us now simplify the notation as follows: 

f/°) = f(W 0 )) (13.22a) 

f = n — d 0) (13.22b) 



Yi ~ PoXiQ + ^l^il + - _ ' + - Ei 

The responses Yf^ in (13.24) are residuals, namely, the deviations of the observations 
around the nonlinear regression function with the parameters replaced by the starting esti¬ 
mates. The X variables observations are the partial derivatives of the mean response * 
evaluated for each of the n cases with the parameters replaced by the. starting estimates. 
Each regression coefficient 成 °、 represents the difference between the true regression pa¬ 
rameter and the initial estimate of the parameter. Thus, the regression coefficients represent 
the adjustment amounts by which the initial regression coefficients must be corrected. The 
purpose of fitting the linear regression model approximation (13.24) is therefore to estimate 
the regression coefficients 成 0 、and use these estimates to adjust the initial starting estimates 
of the regression parameters. In fitting this linear regression approximation, note that there 



520 Part Three Nonlinear Regression 


is no intercept term in the model. Use of a computer multiple regression package theref 0re 
requires a specification of no intercept. 

We shall represent the linear regression model approximation (13.24) in matrix form 
follows: 


Y <0) ^ D («)p(0) + g 


(13.25) 


where: 


(13.25a) Y ⑴) = 


JWfr 


^So D 

- 

(13.25b) D (0) = 

- 

ur 

tl X [} 

… D: 


,( 0 ) 


- 

(13.25c) (13.25d) e = 

f )xl 〆 ） " xl 

Note again that the approximation model (13.25) is precisely in the form of the general 
linear regression model (6,19), with the D matrix of partial derivatives now playing the role 
of the X matrix (but without a column of Is for the intercept). We can therefore estimate 
the parameters p f0) by ordinary least squares and obtain according to (6.25): 

b(o) = ( D _ D (o))-i D( o" Y( (j) (13 26) 

where b (0) is the vector of the least squares estimated regression coefficients. As we noted 
earlier, an ordinary multiple regression computer program can be used to obtain the estimated 
regression coefficients bf ] , with a specification of no intercept. 

We then use these least squares estimates to obtain revised estimated regression coeffi^ 
cients by means of (13.22b): 

where denotes the revised estimate of y k at the end of the first iteration. In matrix form, 
we represent the revision process as follows: 

g <n =g (0) +b (0, (13.27 )； 

At this point, we can examine whether the revised regression coefficients represent: 
adjustments in the proper direction. We shall denote the least squares criterion measure Q 
in (13.15) evaluated for the starting regression coefficients g <0) by SSE {0) \ it is: 

SSE^ - 53 [K ； - /(x“ g (0) )] 2 = 亡 (K,. — /，) 2 (13.28) 

At the end of the first iteration, the revised estimated regression coefficients are g (l) , and 
the least squares criterion measure evaluated at this stage, now denoted by SSE il \ is'- 

挪 " =- /(X,-. g ⑴)] 2 = 亡 （K — (13.29) 




Example 


Chapter 1 3 Introduction to Nonlinear Regression and Neural Networks 521 

If the Gauss-Newton method is working effectively in the first iteration, SSE (1) should be 
smaller than SSE (0) since the revised estimated regression coefficients g 0) should be better 
estimates. 

Note that the nonlinear regression functions /(X,-, g (0 )) and /(X/, g (1) ) are used in 
calculating SSE (0) and SSE (l \ and not the linear approximations from the Taylor series 
expansion. 

The revised regression coefficients g (1 ) are not, of course, the least squares estimates for 
the nonlinear regression problem because the fitted model (13.25) is only an approximation 
of the nonlinear model. The Gauss-Newton method therefore repeats the procedure just 
described, with g ⑴ now used for the new starting values. This produces a new set of 
revised estimates, denoted by g (2) , and a new least squares criterion measure SSE (2) . The 
iterative process is continued until the differences between successive coefficient estimates 
g(s+i) _ g(s) an d/ 0 r the difference between successive least squares criterion measures 
SSE (s+]) — SSE (s) become negligible. We shall denote the final estimates of the regression 
coefficients simply by g and the final least squares criterion measure, which is the error sum 
of squares, by SSE. 

The Gauss-Newton method works effectively in many nonlinear regression applications. 
In some instances, however, the method may require numerous iterations before converging, 
and in a few cases it may not converge at all. 

In the severely injured patients example, the initial values of the parameters yo and y\ 
were obtained by noting that a logarithmic transformation of the response function lin¬ 
earizes it: 

log e ?b[exp()/i X)] = Iog e y 0 + YiX 

Hence, a linear regression model with a transformed Y variable was fitted as an initial 
approximation to the exponential model: 

+ Pi + 

where: 

y ； = log, r, 

A) = log e yo 

Pi = Y\ 

This linear regression model was fitted by ordinary least squares and yielded the estimated 
regression coefficients b 0 = 4.0371 and Z?】=—.03797 (calculations not shown). Hence, 
the initial starting values are ^ 0) = exp(^o) = exp(4.037f) = 56.6646 and ^[ 0) = ^1 = 
— .03797. » 

The least squares criterion measure at this stage requires evaluation of the nonlinear 
regression function (13.7) for each case, utilizing the starting parameter values and 
茗 f). For instance, for the first case, for which = 2, we obtain: * 

/(Xi, g (0) ) = /r = ^ 0) exp^zO = (56.6646) exp [- .03797 (2)] = 52.5208 



522 Part Three Nonlinear Regression 


TABLE 13.2 

Y( 0) and D (0) 

Matrices — 

Severely 

Injured 

Patients 

Example. 


Y<°) 

15x1 


K (0) 


U 5 -哎 


Ki -^expC^XO 


Y 1S -^ 0) exp(^ 0) X 15 )J 


1.47921 
3.1337 
1.5609 
-1.7624 
1.6996 
-2.5422 
-1.1139 
-1.4629 
2.41 72 
- .3871 
-2.2625 
3.1327 
.4259 
-1.8063 




exp(4°) X T ) A exp(^ 0) XO 


exp(g 0) X 15 )Xi 5 exp(^ 0) X 15 ) 


S )J L 

1.1977 」 

r.92687 

105.0416] 

.82708 

23433W 

.76660 

30,4.0736 

.68407 

387.6236 

.58768 

466.2057 

.48606 

523.3020 

.37261 

548.9603 

.30818 

541.3505 

.27500 

529.8162 

.23625 

508.7088 

.18111 

^51.8140 

.13884 

初 9.0975 

.13367 

401A29A 

.10247 

348.3801 

08475 

312.1510 


Since Y\ = 54, the deviation from the mean response is: 

r, (0) = Y X - /, (0) = 54 — 52.5208 = 1.4792 

Note again that the deviation Kj 0) is the residual for case 1 at the initial fitting stage 
since f t ( 0 ) is the estimated mean response when the initial estimates g (0 ) of the parameters 
are employed. The stage 0 residuals for this and the other sample cases are presented in 
Table 13.2 and constitute the Y (0) vector. 

The least squares criterion measure at this initial stage then is simply the sum of the 
squared stage 0 residuals: 

SSE m = (F( . _ f (0)y ^ (r / 0) ) 2 

=(1.4792) 2 + … + (1.1977) 2 = 56.0869 



Chapter 1 3 Introduction to Nonlinear Regression and Neural Networks 523 


To revise the initial estimates for the parameters, we require the D (0) matrix and the 
Y( 0 ) vector. The latter was already obtained in the process of calculating the least squares 
criterion measure at stage 0. To obtain the D (0 ) matrix, we need the partial derivatives of the 
regression function (13.19) evaluated at y = g (0) . The partial derivatives are given in (13.20). 
Table 13.2 shows the D (0) matrix entries in symbolic form and also the numerical values. 
To illustrate the calculations for case 1, we know from Table 13.1 that =2. Hence, 
evaluating the partial derivatives at g (0) , we find: 

=exp^X!) = exp[-.03797(2)] = .92687 
= 4 0) X 1 exp(^ 0) X I ) 

= 56.6646(2) exp[-.03797 (2)] = 105.0416 ^ 

We are now ready to obtain the least squares estimates b (0) by regressing the response 
variable K (0) in Table 13.2 on the two X variables in D (0) in Table 13.2, using regression 
with no intercept. A standard multiple regression computer program yielded = 1.8932 
and = —.001563. Hence, the vector b (0) of the estimated regression coefficients is: 


D ( S = 


邱)= 



y=g' 


y=g” 


( 0 ) 


b (0) = 


1.8932 

-.001563 


By (13.27), we now obtain the revised least squares estimates g ⑴: 


g (0 = g ( 0 ) + b ( 0 ) = 


56.6646 

-.03797 


1.8932 

-.001563 


58.5578 

-.03953 


Hence, = 58.5578 and 茗 ” ）=—.03953 are the revised parameter estimates at the 
end of the first iteration. Note that the estimated regression coefficients have been revised 
moderately from the initial values, as can be seen from Table 13.3a, which presents the 
estimated regression coefficients and the least squares criterion measures for the starting 
values and the first iteration. Note also that the least squares criterion measure has been 
reduced in the first iteration. 

Iteration 2 requires that we now revise the residuals from the exponential regression func¬ 
tion and the first partial derivatives, based on the revised parameter estimates = 58.5578 

and = —.03953. For case I, for which Y\ = 54 and X| = 2, we obtain: 

rf 0 = Yi - f} 1 ) = 54 - (58.5578)exp[-.03953(2)] = -.1065 
= exp(^°X,) = exp[-.03953(2)] = .92398 ^ 

= ^o°^i exp(^°X,) = 58.5578(2)exp[—-03953(2)] = 108.2130 ^ 

By comparing these results with the comparable stage 0 results for case 1 in Table 13.2, 
we see that the absolute magnitude of the residual for case 1 is substantially reduced as a 
result of the stage 1 revised fit and that the two partial derivatives are changed to a moderate 
extent. After the revised residuals F/ 1 ) and the partial derivatives D^q and D”）have been 



524 Part Three Nouliiwar Regression 


TABLE 13.3 

Gauss-Newton 

Method 

Iterations 

and Final 

Nonlinear 

Least Squares 

Estimates — 

Severely 

Injured 

Patients 

Example. 


(a) Estimates of Parameters and Least Squares Criterion Measure 


Iteration 

9o 

9 飞 

SSE 

0 

56.6646 

-.03797 

56.0869 

1 

58.5578 

-.03953 

49.4638 

2 

58.6065 

一 .03959 

49.4593 

3 

58.6065 

-.03959 

49.4593 


(b) Final Least Squares Estimates 


k 


0 

1 


9k 如 } 


58.6065 1.472 

-.03959 .00171 


MS£ = 


49.4593 

13 


— 3.80456 


(c) Estimated Approximate Variance-Covariance Matrix of 
Estimated Regression Coefficients 


$ 2 {g] = MS£(D'D 广 1 二 3.80456 


5.696E-1 -4.682E-4 

-4.682E-4 7.697E-7 


2.1672 -1.781 E-3 

-1.781E-3 2.928E-6 


obtained for all cases, the revised residuals are regressed on the revised partial deiivatives, 
using a no-intei.cept regression fit, and the estimated regression parameters are again revised 
according to (13.27). 

This process was carried out for three iterations. Table 13.3a contains the estimated' 
regression coefficients and the least squares criterion measure for each iteration. We see 
that while iteration 1 led to moderate revisions in the estimated regression coefficients and 
a substantially better fit according to the least squares criterion, iteration 2 resulted only in 
minor revisions of the estimated regression coefficients and little improvement in the fit 
Iteration 3 led to no change in either the estimates of the coefficients or the least squares 
criterion measure. 

Hence, the search procedure was terminated after three iterations. The final regression' 
coefficient estimates therefore areg 0 = 58.6065 and = — .03959, and the fitted regression ； 
function is: 

Y = (58.6065) exp(-,03959X) (13.30) 

The error sum of squares for this fitted model is SSE = 49.4593. Figure 13.2 on page 51$ 
shows a plot of this estimated regression function, together with a scatter plot of the data. 
The fit appears to be a good one. 

Comments 

L The choice of initial starting values is very important with the Gauss-Newton method because 
a poor choice may result in slow convergence, convergence to a local minimum, oi 1 even divergence. 


Chapter 13 Introduction to Nonlinear Regression and Neural Networks 525 


Good starting values will generally result in faster convergence, and if multiple minima exist, will 
lead to a solution that is the global minimum rather than a local minimum. Fast convergence, even if 
the initial estimates are far from the least squares solution, generally indicates that the linear approxi¬ 
mation model (13.25) is a good approximation to the nonlinear regression model. Slow convergence, 
on the other hand, especially from initial estimates reasonably close to the least squares solution, 
usually indicates that the linear approximation model is not a good approximation to the nonlinear 
model. 

2. A variety of methods are available for obtaining starting values for the regression parameters. 
Often, related earlier studies can be utilized to provide good starting values for the regression parame¬ 
ters. Another possibility is to select p representative observations, set the regression function /(X,. ， y) 
equal to Y t for each of the p observations (thereby ignoring the random error), solve the p equations 
for the p parameters, and use the solutions as the starting values, provided they lead to reasonably 
good fits of the observed data. Still another possibility is to do a grid search in the parameter space 
by selecting in a grid fashion various trial choices of g, evaluating the least squares criterion Q for 
each of these choices, and using as the starting values that g vector for which Q is smallest. 

3. When using the Gauss-Newton or another direct search procedure, it is often desirabie to try 
other sets of starting values after a solution has been obtained to make sure that the same solution will 
be found. 

4. Some computer packages for nonlinear regression require that the user specify the starting 
values for the regression parameters. Others do a grid search to obtain starting values. 

5. Most nonlinear computer programs have a library of commonly used regression functions. 
For nonlinear response functions not in the libraiy and specified by the user, some computer pro¬ 
grams using the Gauss-Newton method require the user to input also the partial derivatives of the 
regression function, while others numerically approximate partial derivatives from the regression 
function. 

6. The Gauss-Newton method may produce iterations that oscillate widely or result in increases 
in the error sum of squares. Sometimes, these aberrations are only temporary, but occasionally serious 
convergence problems exist. Various modifications of the Gauss-Newton method have been suggested 
to improve its performance, such as the Hartley modification (Ref. 13.1). 

7. Some properties that exist for linear regression least squares do not hold for nonlinear regression 

least squares. For example, the residuals do not necessarily sum to zero for nonlinear least squares. 
Additionally, the error sum of squares SSE and the regression sum of squares SSR do not necessar- 
ily sum to the total sum of squares SSTO. Consequently, the coefficient of multiple determination 
R 2 = SSR/SSTO is not a meaningful descriptive statistic for nonlinear regression. ■ 

Other Direct Search Procedures 

Two other direct search procedures, besides the Gauss-Newton method, that are frequently 
used are the method of steepest descent and the Marquardt algorithm. The method of 
steepest descent searches for the minimum least squares criterion measure Q by iteratively 
determining the direction in which the regression coefficieets g should be changed. The 
method of steepest descent is particularly effective when the starting values g ⑼ are not 
good, being far from the final values g. 

The Marquardt algorithm seeks to utilize the best features of the Gauss-Newton method 
and the method of steepest descent, and occupies a middle grouiid between these two 
methods. 

Additional information about direct search procedures can be found in specialized 
sources, such as References 13.2 and 13.3. 



526 Part Three Nonlinear Reffix j s r siou 


13.3 


Model Bujldiu^ ami Piafinostjcs 



The model-building process for nonlinear regression models often differs somewhat f 
that for linear regression models. The reason is that the functional form of many nonlinear 
models is less suitable for adding or deleting predictor variables and curvature and interac 
tion effects in the direct fashion that is feasible for linear iiegression models. Some typ es 
nonlinear regression models do lend themselves to adding and deleting predictor variables 
in a direct fashion. We shall take up two such nonlinear regression models in Chapter 14 
where we consider the logistic and Poisson multiple regression models. ’ 


Validation of the selected nonlinear regression model can be performed in the same 
fashion as for linear regression models. 

Use of diagnostic tools to examine the appropriateness of a fitted model plays an impor¬ 
tant role in the process of building a nonlinear regression model. The appropriateness of a 
regression model must always be considered, whether the model is linear or nonlinear. Non¬ 
linear regression models may not be appropriate for the same reasons as linear nsgr^ssion 
models. For example, when nonlinear growth models are used for time series data, there 
is the possibility that the error terms may be correlated. Also, unequal error variances are 
often present when nonlinear growth models with asymptotes are fitted, such as exponential 
models (13.6) and (13.8). Typically, the error variances for cases in the neighbortiood of 
the asymptote(s) differ from the error variances for cases elsewhere. 

When replicate observations aie available and the sample size is reasonably large, the ap¬ 
propriate ness of a nonlinear regression function can be tested formally by means of the lack 
of fit test for linear regression models in ( 6 . 68 ). This test will be an approximate one for non¬ 
linear regression models, but the actual level of significance will be close to the specified level 
when the sample size is reasonably large. Thus, we calculate the pure error sum of squares 
by (3.16), obtain the lack of fit sum of squares by (3.24), and calculate test statistic ( 6 . 68 b) 
in the usual fashion when performing a formal lack of fit test for a nonlinear response 
function. 


Plots of residuals against time, against the fitted values, and against each of the predictor 
variables can be helpful in diagnosing departures from the assumed model, just as for 
linear regression models. In interpreting residual plots for nonlinear regression, one needs 
to remember that the residuals for nonlinear regression do not necessarily sum to zero. 

If unequal error variances are found to be present, weighted least squares can be used 
in fining the nonlinear regression model. Alternatively, transformations of the response 
variable can be investigated that may stabilize the variance of the error terms and also 
permit use of a linear regression model. 


Example 


In the severely injured patients example, the residuals were obtained by use of the fitted 
nonlinear regression function (13.30): 

e, = Y, — (58.6065) exp(—.03959X,) 

A plot of the residuals against the fitted values is shown in Figure 13.3a, and a normal 
probability plot of the residuals is shown in Figure 13.3b. These plots do not suggest any 
serious departures from the model assumptions. The residual plot against the fitted values 
in Figure 13.3a does raise the question whether the error variance may be somewhat larger 
for cases with small fitted values near the asymptote. The Bmwn-Forsythe test (3.9) was 



Chapter 13 Introduction to Nonlinear Regression and Neural Networks 527 


(a) Residual Plot against Y 
e 

3.0 - . 


(b) Normal Probability Plot 














FIGURE 133 
pjaguostic 



Plots— 

Severely 

loured 

patients 

Example- 


- 2.0 


0 10 20 30 40 50 y 

Fitted Value 


— 2.0 


3.0 -2.0 —1.0 0 1.0 2.0 3.0 

Expected Value ^ 


conducted. Its F-value is .64, indicating that the residuals are consistent with constancy of 
the error variance. / 

On the basis of these, as well as some other diagnostics, it was concluded that exponential 
regression model (13.13) is appropriate for the data. 


13.4 Inferences about Nonlinear Regression Parameters_ 

Exact inference procedures about the regression parameters are available for linear regres¬ 
sion models with normal error terms for any sample size. Unfortunately, this is not the 
case for nonlinear regression models with normal error terms, where the least squares and 
maximum likelihood estimators for any given sample size are not normally distributed, are 
not unbiased, and do not have minimum variance. 

Consequently, inferences about the regression parameters in nonlinear regression are 
usually based on large-sample theory. This theory tells us that the least squares and maximum 
likelihood estimators for nonlinear regression models with normal error terms, when the 
sample size is large, are approximately normally distributed and almost unbiased, and 
have almost minimum variance. This large-sample theory also applies when the error terms 
are not normally distributed. 

Before presenting details about large-sample inferences for nonlinear regression, we 
need to consider first how the error term variance <r 2 is estimated for nonlinear regression 
models. 1 

Estimate of Error Term Variance 

Inferences about nonlinear regression parameters require an estimate of the error term 
variance <r 2 . This estimate is of the same form as for linear regr^sion, the error sum of 
squares again being the sum of the squared residuals: * 


MSE 


sse E(d) 2 E/(X/ ， g)]: 


n — p 


n 


n 


P 


(1331) 


曹 


2 . 


lEnpJsacc: 






528 Part Three Nonlinear Regreasion 


Here g is the vector of the final parameter estimates, so that the residuals are the devi r 
around the fitted nonlinear regression function using the final estimated regression c fR S 
cients g. For nonlinear regression, MSE is not an unbiased estimator of a 2 ^ but the bias ■' 
small when the sample size is large. ls 


Large-Sample Theory 

When the error terms are independent and normally distributed and the sample size • 
reasonably large, the following theorem provides the basis for inferences for nonline .吴 
regression models: 


When the error terms are independent N (0, o 2 ) and the sample size n 
is reasonably large, the sampling distribution of g is approximately 
normal. The expected value of the mean vector is approximately: 

E{g} ^ y 

The approximate variance-covariance matrix of the regression 
coefficients is estimated by: 


(13,32) 


(! 3.32a) 


s 2 {g} = 


(13.32b) f 


Here D is the matrix of partial derivatives evaluated at the final least squares estimates g 
just as D (0) in (13.25b) is the matrix of partial derivatives evaluated at g (() ). Note that the 
estimated approximate variance-covariance matrix s 2 {g} is of exactly the same form as the 
one for linear regression in (6.48), with D again playing the role of the X matrix. 


Thus, when the sample size is large and the error terms are independent normal with con¬ 
stant variance, the least squares estimators in g for nonlinear regression are approximately 
normally distributed and almost unbiased They also have near minimum variance, since 
the variance-covariance matrix in (13.32b) estimates the minimum variances. We should 


add that theorem (13.32) holds even if the error terms are not normally distributed. 

As a result of theorem (13.32), inferences for nonlinear regression parameters are carried 
out in the same fashion as for linear regression when the sample size is reasonably large. & 
Thus, an interval estimate for a regression parameter is carried out by (6.50) and a test 
by (6,51). The needed estimated variance is obtained from the matrix s 2 {g} in (13.32b).,' 
These inference procedures when applied to nonlinear regression are only approximate, to 
be sure, but the approximation often is very good. For some nonlinear regression models, 
the sample size can be quite small for the large-sample approximation to be good. For other" 
nonlinear regression models, however, the sample size may need to be quite large. 


When Is Large-Sample Theory Applicable? 

Ideally, we would like a rule that would tell us when the sample size in any given nonlinear] 
regression application is large enough so that the large-sample inferences based cm asymp- 
totic theorem (13.32) are appropriate. Unfortunately, no simple rule exists that tells us wheflj 
it is appropriate to use the large-sample inference methods and when it is not appropriate. 
However, a number of guidelines have been developed that are helpful in assessing the 
appropriateness of using the large-sample inference procedures in a given application. 

>■ 

1. Quick convergence of the iterative procedure in finding the estimates of the nonlinear ^ 
regression parameters is often an indication that the linear approximation in (13.25) 



Chapter 1 3 Introduction to Nonlinear Regression and Neural Networks 529 


the nonlinear regression model is a good approximation and hence that the asymptotic 
properties of the regression estimates are applicable. Slow convergence suggests caution 
and consideration of other guidelines before large-sample inferences are employed. 

2. Several measures have been developed for providing guidance about the appropriate¬ 
ness of the use of large-sample inference procedures. Bates and Watts (Ref. 13.4) devel¬ 
oped curvature measures of nonlinearity. These indicate the extent to which the nonlinear 
regression function fitted to the data can be reasonably approximated by the linear approx¬ 
imation in (13.25). Box (Ref. 13.5) obtained a formula for estimating the bias of the esti¬ 
mated regression coefficients. A small bias supports the appropriateness of the large-sample 
inference procedures. Hougaard (Ref. 13.6) developed an estimate of the skewness of the 
sampling distributions of the estimated regression coefficients. An indication of little skew¬ 
ness supports the approximate normality of the sampling distributions and consequently the 
applicability of the large-sample inference procedures. 

3. Bootstrap sampling described in Chapter 11 provides a direct means of examining 
whether the sampling distributions of the nonlinear regression parameter estimates: are 
approximately normal, whether the variances of the sampling distributions are near the 
variances for the linear approximation model, and whether the bias in each of the parameter 
estimates is fairly small. If-so, the sampling behavior of the nonlinear regression estimates is 
said to be close-to-linear and the large-sample inference procedures may appropriately be 
used. Nonlinear regression estimates whose sampling distributions are not close to normal, 
whose variances are much larger than the variances for the linear approximation model, 
and for which there is substantial bias are said to behave in a far-from-linear fashion and 
the large-sample inference procedures are then not appropriate. 

Once many bootstrap samples have been obtained and the nonlinear regression parameter 
estimates calculated for each sample, the bootstrap sampling distribution for each param¬ 
eter estimate can be examined to see if it is near normal. The variances of the bootstrap 
distributions of the estimated regression coefficients can be obtained next to see if they are 
close to the large-sample variance estimates obtained by (13.32b). Similarly, the bootstrap 
confidence intervals for the regression coefficients can be obtained and compared with the 
laige-sample confidence intervals. Good agreement between these intervals again provides 
support for the appropriateness of the large-sample inference procedures. In addition, the 
difference between each final regression parameter estimate and the mean of its bootstrap 
sampling distribution is an estimate of the bias of the regression estimate. Small or negligible 
biases of the nonlinear regression estimates support the appropriateness of the large-sample 
inference procedures. 

Remedial Measures. When the diagnostics suggest that large-sample inference proce¬ 
dures are not appropriate in a particular instance, remedial measures should be explored. 
One possibility is to reparameterize the nonlinear regression model. For example, studies 
have shown that for the nonlinear model: „ 

^ = Yo^i/iVi + ^,) + Si i 

the use of large-sample inference procedures is often not appropriate. However, the follow¬ 
ing reparameterization: 


Y i = x l /(e l x i + e 2 ) + s i 



530 Part Three Nonlinear Rt\t>ix\ssion 


where = l/y () and 0 2 = yi/yo, yields identical fits and generally involves no problem 
in using large-sample inference procedures for moderate sample sizes (see Ref. 13.7 f or 
details). 

Another remedial measure is to use the bootstrap estimates of precision and confidence 
intervals instead of the large-saniple inferences. However, when the linear approximation 
in (13.25) is not a close approximation to the nonlinear regression model, convergence tn a y 
be very slow and bootstrap estimates of precision and confidence intervals may be difficult to 
obtain. Still another remedial measure that is sometimes available is to increase the satnpj e 
size. 


Example 


For the severely injured patients example, we know from Table 13.3a on page 524 that 
the final error sum of squares is SSE 二 49.4593. Since i? = 2 parameters are present in the 
nonlinear response function (13.19), we obtain: 


MSE 二 


SSE 
n — p 


49.4593 

U 


= 3.80456 


Table 13.3b presents this mean square, and Table 13.3c contains the large-sample estimated 
variance-covariance matrix of the estimated regression coefficients. The matrix (D'D) -1 is 
based on the final regression coefficient estimates g and is shown without computational 
details. 

We see from Table 13.3c that .? 2 { 尽 0 ) = 2.1672 and .v 2 {^i} = .000002928. The estimated 
standard deviations of the regression coefficients are given in Table 13.3b. 

To check on the appropriateness of the large-sample variances of the estimated regression 
coefficients and on the applicability of large-sample inferences in general, we have generated 
1,000 bootstrap samples of size 15. The fixed X sampling procedure was used since the 
exponential model appears to fit the data well and the error terni variance appeal's to be 
fairly constant. Histograms of the resulting bootstrap sampling distributions of and g* 
are shown in Figure 13.4, together with some characteristics of these distributions. We see 
that the distribution is close to normal. The distribution suggests that the sampling 
distribution may be slightly skewed to the left, but the departure from normality does not 
appear to be great. The means of the distribution, denoted by t gQ and are very close to 
the final least squares estimates, indicating that the bias in the estimates is negligible: 


= 58.67 gf = -.03936 

go — 58.61 gi = —.03959 


Furthermore, the standard deviations of the bootstrap sampling distributions are very close 
to the large-sample standard deviations in Table 13.3b: 

s^{g^} = 1.423 s*{g^} = .00142 

也 0 } 二 1.472 也 } 二 . 00171 

These indications all point to the appropriateness of large-sample inferences here, even 
though the sample size (/? = 15) is not very large. 



Chapter 13 Introduction to Nonlinear Regression and Neural Networks 531 


= -.03936 
s *{pT} = -00142 
^*(.025) = -.04207 
gX975) = -.03681 


everely Injured Patients Example. 


(b) Histogram of Bootstrap Estimates g* 


14 


與匕 UrE 13.4 Bootstrap Sampling Distributions — S 

(a) Histogram of Bootstrap Estimates 


14 

12 

10 

8 


£ 




54 55 56 57 58 59 60 61 62 63 

9l 


gl = 58.67 
s*{go) = 1 -423 
^(.025) = 56.044 
^(.975) = 61.436 


Interval Estimation of a Single 

Based on large-sample theorem (13.32), the following approximate result holds when the 
sample size is large and the error terms are normally distributed: 

~~~ t (n - p) k = 0, l,..., p — l (13.33) 

咖 } 

where tin — p) is a r variable with n — p degrees of freedom. Hence, approximate l — a 
confidence limits for any single y k are formed by means of (6.50): 

叙士 — a/2;n - p)5{^} (13.34) 

where r(l — a/2; n — p) is the (1 — ct ； /2)100 percentile of the t distribution with n — p 
degrees of freedom. 

Example F° r the severely injured patients example, it is desired to estimate y{ with a 95 percent 

-- confidence interval. We require f (.975; 13) 二 2.160, and findfrom Table 13.3b that = 

—.03959 and 5 (^ 1 } = .00171. Hence, the confidence limits are —.03959 士 2.160^.00171), 
and the approximate 95 percent confidence interval for y t is: 

-.0433 < y, < -.0359 , 

Thus, we can conclude with approximate 95 percent confidence that is between —.0433 
and —.0359. To confirm the appropriateness of this large-sample confidence interval, we 




Tss-Q- 

9S0- 

zi- 

-s£§- 
-6S0- 

i 0 寸 0-0 — 
二寸 0-0 — 

Jrsls o - 

-£Kro- 

—寸寸 00— 


*pl 


2 0 8 6 4 2 0 

1 1 



532 Part Three Nonlinear Regression 

shall obtain the 95 percent bootstrap confidence interval for y\ . Using (11.58) and the res l 
in Figure 13.4b, we obtain: ts 

cU = g'— <(.025) 二 -.03959 + .04207 = .00248 
ch = t ^(.975) — 幻 =-.03681 + .03959 二 .00278 

The reflection method confidence limits by (11.59) then are ： 

- ch = -.03959 - .00278 = -.04237 
-.03959 + .00248 = - .03711 

Hence, the 95 percent bootstrap confidence interval is —.0424 < y[ < —.0371. This con- 
fidence interval is very close to the large-sample confidence interval, again supporting the 
appropriateness of large-sample inference procedures here. 

Simultaneous Interval Estimation of Several yu 

Approximate joint confidence intervals for several regi ession parameters in nonlinear re¬ 
gression can be developed by the Bonferroni pi ocedui e. If m parameters ai.e to be estimated 
with approximate family confidence coefficient 1 — a, the joint Bonferroni confidence 
limits are: 

gk ± Bs{g k } (13.35) 

where ： 

B = t(\ - a/2m;n - p) (13.35a) 

In the severely injured patients example, it is desired to obtain sinuiltaneous intervales' 
ti mates for y () and yi with an approximate 90 percent family confidence coefficient. With 
the Bonferroni procedure we therefore require separate confidence intervals for the two 
parameters, each with a 95 percent statement confidence coefficient. We have already ob' 
tained a confidence interval for y, with a 95 percent statement confidence coefficient. The 
approximate 95 percent statement confidence limits for yo, using the results in Table 13.3b, 
are 58.6065 ± 2.160(1.472) and the confidence interval for yo is: 

55.43 <y 0 < 61.79 

Hence, the joint confidence intervals with approximate family confidence coefficient of 
90 percent are: 

55.43 < y 0 < 61.79 
-.0433 <Yi < -.0359 

Test Concerning a Single y k 

A large-sample test concerning a single y k is set up in the usual fashion. To test: 


Example 


Ho： y k = n-o 

Ha - Yk ^ Yko 


(13.36a: 



Chapter 1 3 Introduction to Nonlinear Regression and Neural Networks 533 


where y/to is the specified value of we may use the t* test statistic based on (6.49) when 
n is reasonably large: 


gk — VkO 

s[gk] 


(13.36b) 


The decision rule for controlling the risk of making a Type I error at approximately a then is: 


If \t*\ < t{\ — a/2\n — p), conclude Ho 
If \i*\ > t{\ — a/2\n — p), conclude H a 


(13.36c) 


一 "-:- In the severely injured patients example, we wish to test: 

Example_ 

——- H 0 : yo = 54 

H a ： Vo ^ 54 


The test statistic (13.36b) here is: 


58.6065 - 54 
^1.472~ 


= 3.13 


For a = .01, we require f (.995; 13) = 3.012. Since |/*j = 3.13 > 3.012, we conclude H a , 
that y 0 ^ 54. The apiJtoximate two-sided F-value of the test is .008. 


Test Concerning Several y k 


When a large-sample test concerning several y k simultaneously is desired, we use the same 
approach as for the general linear test, first fitting the full model and obtaining SSE(F )， 
then fitting the reduced model and obtaining SSE(R), and finally calculating the same test 


statistic (2.70) as for linear regression: 

SSE(R) - SSE(F) 

F — ~ 


4 - MSE(F) 


(1337) 


dfR — df F 

For large n, this test statistic is distributed approximately as F(df R — df F , df F ) when Hq 
holds. 


13.5 Learning Curve Example_ 

We now present a second example, to provide an additional illustration of the nonlin¬ 
ear regression concepts developed in this chapter. An electronics products manufacturer 
undertook the production of a new product in two locations (location A: coded X[ = 1, 
location B: coded X { = 0). Location B has more modem facilities and hence was expected 
to be more efficient than location A, even after the initial learning period. An industrial en¬ 
gineer calculated the expected unit production cost for a modem facility after learning has 
occurred. Weekly unit production costs for each location were then expressed as a fraction 
of this expected cost. The reciprocal of this fraction is a measure of relative efficiency, and 
this relative efficiency measure was utilized as the response variable (^) in the study. 

It is well known that efficiency increases over time when a new product is produced, 
and that the improvements eventually slow down and the process 1 stabilizes. Hence, it was 
decided to employ an exponential model with an upper asymptote for expressing the relation 
between relative efficiency (Y) and time (X 2 ), and to incorporate a constant effect for the 



534 Part Three Nonlinear Regression 


FIGURE 13.5 
Scatter Plot 
and Fitted 
Nonlinear 
Regression 
Functions — 
Learning 
Curve 
Example. 


Y= 1.0156 - .5524 exp(-.1348X) 



10 30 50 70 90 


Time (week) 


difference in the two production locations. The model decided on was: 

^ — Ko + + Y3 ex P(Kz^/2) + (I3.3g) 

When yi and y 3 are negative, yo is the upper asymptote for location B as X 2 gets large and 
yo + yi is the upper asymptote for location A. The parameters y 2 and y 3 reflect the speed 
of learning, which was expected to be the same in the two locations. 

While weekly data on relative production efficiency for each location were available w e 
shall only use observations for selected weeks during the first 90 weeks of production to 
simplify the presentation. A portion of the data on location, week, and relative efficiency j s 
presented in Table 13.4; a plot of the data is shown in Figure 13.5. Note that learning was 
relatively rapid in both locations, and that the relative efficiency in location B toward the 


TABLE 13.4 

Data — 
Learning 
Curve 
Example. 


Observation 

Location 

Week 

Relative Efficiency 

/ 

入 n 

入 ， 2 

Y, 

1 

i 

1 

.483 

2 

i 

2 

.539 

3 

i 

3 

.618 

13 

i 

70 

.960 

14 

i 

80 

.967 

15 

i 

90 

.975 

16 

0 

1 

.517 

17 

0 

2 

.598 

18 

0 

3 

.635 

28 

0 

70 

1.028 

29 

0 

80 

1.017 

30 

0 

90 

1.023 


■ 2 


- 0 - 8 - 6 

loo' 

Au ' c a u ifcLUaAI'iJela cc 



Chapter 1 3 Introduction to Nonlinear Regression and Neural Networks 535 


end of the 90-week period even exceeded 1.0; i.e., the actual unit costs at this stage were 
lower than the industrial engineer’s expected unit cost 

Regression model (13.38) is nonlinear in the parameters y 2 and Hence, a direct 
numerical search estimation procedure was to be employed, for which starting values for 
the parameters are needed. These were developed partly from past experience, partly from 
analysis of the data. Previous studies indicated that 朽 should be in the neighborhood of—.5, 
so g? 、 二一 .5 was used as the starting value. Since the difference in the relative efficiencies 
between locations A and B for a given week tended to average —.0459 during the 90-week 
period, a starting value ^{ 0) = —.0459 was specified. The largest observed relative efficiency 
for location B was 1.028, so that a starting value = 1.025 was felt to be reasonable. 
Only a starting value for y% remains to be found. This was chosen by selecting a typical 
relative efficiency observation in the middle of the time period, F 24 = 1.012, and equating 
it to the response function with = 0, X 2 4,2 = 30, and the starting values for the other 
regression coefficients (thus ignoring the error term): 

1.012 = 1.025 - (.5) exp(30y 2 ) 一 

Solving this equation for the starting value g?) = —.122 was obtained. Tests for several 
other representative observations yielded similar starting values, and = — -122 was 
therefore considered to be a reasonable initial value. 

With the four starting values = 1.025, 哀 j 0 ) = — -0459, g 0 ) = —.122, and g 0 ) = —.5, a 
computer package direct numerical search program was utilized to obtain the least squares 
estimates. The least squares regression coefficients stabilized after five iterations. The final 
estimates, together with the large-sample estimated standard deviations of their sampling 
distributions, are presented in Table 13.5, columns 1 and 2. The fitted regression function is: 

F 二 1.0156 - .04727X t - (.5524) exp(-.1348X 2 ) (13.39) 

The error sum of squares is SSE= .00329, with 30—4 = 26 degrees of freedom. Figure 13.5 
presents the fitted regression functions for the two locations, together with a plot of the data. 
The fit seems to be quite good, and residual plots (not shown) did not indicate any noticeable 
departures from the assumed model. 

In order to explore the applicability of large-sample inference procedures here, bootstrap 
fixed X sampling was employed. One thousand bootstrap samples of size 30 were generated 


TABLE 13.5 Nonlinear Least Squares Estimates and Standard Deviations and Bootstrap 
Results~Leamii^ Curve Example. 


(1) (2) 
Nonlinear 

Least Squares 

⑶ 

Bootstrap 

m 

9k 

办 } 

M;. 


1.0156 

.003672 

1^015605 

.003374 

-.04727 

.004109 

-.04724 

&S762 - 

-.5524 

.008157 

-:55283 J 

Kr'' • 

： 0D7275 

—.1348 

.004359 

-.13495 

.004102 



536 Part Three Nonlinear Rcffre.wsion 


j^[ 


r 


-cd_ 


T jlf 


1.004 


1.015 

9o 

⑻ 


1.026 一 0.061 - 0.050 0.039 -0.578 —0.556 


tjW 


gfi 

(b) 


9l 

(c) 


0,534 —0.148 —0.137 —o 126 

53 - 

(cl) 


The estimated bootstrap means and standard deviations for each of the sampling distiibutions 
are presented in Table 13.5, columns 3 and 4. Note first that each least squares estimate 
g k in column 1 of Table 13.5 is very close to the mean gl of its respective bootstrap 
sampling distribution in column 3. indicating that the estimates have very little bias. Note 
also that each large-sample standard deviation in column 2 of Table 13.5 is f£drly 
close to the respective bootstrap standard deviation /{ 名广 } in column 4, again supporting the 
applicability of large-sample inference procedures here. Finally, we piesent in Figure 13.6 
MINITAB plots of the histograms of the four bootstrap sampling distributions. They appear 
to be consistent with approximately normal sampling distributions. These results all indcate 
that the sampling behavior of the nonlinear regression estimates is close to linear and 
therefore support the use of large-sample inferences here. 

There was special interest in the parameter yi, which reflects the effect of location. An 
approximate 95 percent confidence interval is to be constructed. We require f(.975;26) 
= 2.056. The estimated standard deviation from Table 13.5 is } = .004109. Hence, the 
approximate 95 percent confidence limits for y\ are —.04727 ± 2.056(.004109), and the 
confidence interval for y\ is: 

-,0557 < y, < —.0388 

An approximate 95 percent confidence interval for y\ by the bootstrap reflection method 
was also obtained for comparative purposes using (11.59). It is ： 

— .0547 < yi < —.0400 

This is very close to that obtained by large-sample inference procedures. Since y\ is seen to 
be negative, these confidence intervals confirm that location A with its less modern facilities 
tends to be less efficient. 

Comments 

I. When learning curve models are fitted to data constituting repeated observations on the same 
unit, such as efficiency data for the same production unit at different points in time, the error terms may 
be correlated. Hence, in these situations it is important to ascertain whether or not a mtxlel assuming 


FIGURE 13,6 MINITAB Histograms of Bootstrap Sampling Distributions — Learning Curve Example. 


0.15 r 


0.15 


0.15 


0.15 


o 

-I 

d 


05 

d 


o 

-I 

d 


05 

d 


o 

-I 

d 


os 

d 


o 


- 05 


AuuanbaJdaAI^-a^ 



Chapter 1 3 Introduction to Nonlinear Regression and Neural Networks 537 


uncorrelated error terms is reasonable. In the learning curve examplej a plot of the residuals against 
time order did not suggest any serious correlations among the error terms. 

2. With learning curve models, it is not uncommon to find that the error variances are unequal. 
Again, therefore, it is important to check whether the assumption of constancy of error variance is 
reasonable. In the learning curve example, plots of the residuals against the fitted values and time did 
not suggest any serious heteroscedasticity problem. ■ 


13.6 Introduction to Neural Network Modeling 

-- 

In recent years there has been an explosion in the amount of available data, made possible 
in part by the widespread availability of low-cost computer memory and automated data 
collection systems. The regression modeling techniques discussed to this point in this book 
typically were developed for use with data sets involving fewer than 1,000 observations and 
fewer than 50 predictors. Yet it is not uncommon now to be faced with data sets involving 
perhaps millions of observations and hundreds or thousands of predictors. Examples include 
point-of-sale data in marketing, credit card scoring data, on-line monitoring of production 
processes, optical character recognition, internet e-mail filtering data, microchip array data, 
and computerized medical record data. This exponential growth in available data has moti¬ 
vated researchers in the fields of statistics, artificial intelligence, and data mining to develop 
simple, flexible, powerful procedures for data modeling that can be applied to very large 
data sets. In this section we discuss one such technique, neural network modeling. 


Neural Network Model 

The basic idea behind the neural network approach is to model the response as a nonlinear 
function of various linear combinations of the predictors. Recall that our standard multi¬ 
ple regression model (6.7) involves just one linear combination of the predictors, namely 

E{Yi] = 00 + 爲久 “ -( - 1- gp—iXi’p—i. Thus, as we will demonstrate, the neural network 

model is simply a nonlinear statistical model that contains many more parameters than the 
corresponding linear statistical model. One result of this is that the models will typically 
be overparameterized, resulting in parameters that are uninterpretable, which is a major 
shortcoming of neural network modeling. An advantage of the neural network approach 
is that the resulting model will ofteff perform better in predicting future responses than a 
standard regression model. Such models require large data sets, and are evaluated solely on 
their ability to predict responses in hold-out (validation) data sets. 

In this section we describe the simplest, but most widely used, neural network model, 
the single-hidden-layer, feedforward neural network. This network is sometimes referred 
to as a single-layer perceptron. In a neural network model the ith response Y[ is modeled 
as a nonlinear function gy of m derived predictor values, // /0 , H n ,...., 


K 二 grWoHio + A 札 + … + gy^ + Ei (13.40) 


where: 


mx I 


"A)" 
A 


rj ~ 


_Pm— \ _ 

mx[ 



(13.40a) 



538 Part Three NouHn^ar Reyes'sion 


We take H n) equal to 1 and for j = I./? — I, the / th derived predictor value for the 他 

observation, f/ (/ ,is a nonlinear function of a linear combination of the original predictors* 

03 . 41 ) 


where ： 


pyA 


只 /( x :°0 .〕 

j = ^ - 

…川一 

^/o 

0( ji 

x,= 

1>X 1 


-°^> 一 1 _ 




(13.41a) 


and where X J( , = 1. Note that XJ is the /'th row of the X matrix. Equations (13.40) and 
(13.41) together form the neural network model: 


+ b ； = 


nt—\ 


A) + ^ 




( 13 . 42 ) 


The m functions gy, _ 只 ", —i are called activation functions in the neural networks 

literature. To completely specify the neural network model, it is necessary to identify them 
activation functions. A common choice for each ot. these functions is the logistic function: 

S(Z) = = [1 + e~ z r l (13.43) 

This function is flexible and can be adapted to a variety of circumstances. 

As a simple example, consider the case of a single predictor, X |. Then from (13.41), the 
jih derived predictor for the / th observation is: 

= 11 + exp(-a, () - aji X f l )] 1 (13.44) 

(Note that (13.44) is a repara meter i zatio n of (13.1 1)，with y () = 1, y\ — e~ a,i \ and y 2 = 
This function is shown in Figure 13.7 for various choices of a and 〜卜 In Fig¬ 
ure 13.7a, the logistic function is plotted for fixed = 0, and c^i 二 ， 1 ， 1， and 10. When 
an ~ A, the logistic function is approximately linear over a wide range ； when o-yi = 10, 
the function is highly nonlinear in the center of the plot. Generally, relatively larger param¬ 
eters (in absolute value) are required for highly nonlinear responses, and relatively smalla - 
parameters result for approximately linear responses. Changing the sign of reverses the 
orientation of the logistic function, as shown in Figure 13.7b. Finally, for a given value of 
aji ，the position of the logistic function along the X|-axis is controlled by o^(). In Figure 
13.7c, the logistic function is plotted for fixed = 1 and ajo — — 5, 0. and 5. Note that 
all of the plots in Figure 13.7 reflect a characteristic S- or sigmoidal-shape, and the fact that 
the logistic function has a maximum of 1 and a minimum of 0. 

Substitution otin (13.43) for each of gy, g\, ..., g,,, \ in (13.42) yields the specific 
neural network model to be discussed in this section: 


K, 二 [1 + exp(-H;P)「 1 + 心 


1 + exp 


HI I 

_ A) — P,\ 1 + exp(—X-oi,)!^ 1 



+ 


=/(X,-, Of|.P) + ^ 


( 13 . 45 ) 





Chapter 1 3 Introduction to Nonlinear Regression and Neural Networks 539 


FIGURE 13.7 


Various Logistic Activation Functions for Sii^le Predictor. 

a /0 = 0 


«/o = 0 






«,1 = 1 



where: 

P，of I，, (Xfn-i are unknown parameter vectors 
X/ is a vector of known constants 
£[ are residuals 

Neural network model (13.45) is a special case of (15.12) and is therefore a nonlinear 
regression model. In principle, all of the methods discussed in this chaptej: for estimation, 
testing, and prediction with nonlinear models are applicable. Indeed, any nonlinear regres¬ 
sion package can be used to estimate the unknown coefficients. Recall, however, thatthese 
models are generally overparameterized, and use of standard estimation methods will result 
in fitted models that have poor predictive ability. This is analogous to leaving too many 
unimportant predictors in a linear regression model. Special procedures for fitting model 
(13.45) that lead to better prediction will be considered later in this section. 





540 Part Three Nonlinear Regression 


Note that because the logistic activation function is bounded between 0 and 
necessary to scale Y,- so that the scaled value, Y. 1 ' also falls wilhin these limits. This 
accomplished by using: 


1， it is 

can be 











where F m i n and F ma x are the minimum and maximum responses. It is also common practice 
to center and scale each of the predictors to have mean 0 and standard deviation 1 These 
transfonnations are generally handled aulomatically by neural network software. 


Network Representation 

Network diagrams are often used to depict a neural network model. Note that the standard 
linear regression function ： 


E{Y) = A) + + • • • + 1 


can be represented as a network as shown in Figure 13.8a. The link from each predictor X- 
to the response is labeled with the corresponding regression parameter , 色卜 


The feedforward, single-hidden-layer neural network model (.13.45) is shown in Fia, 

ure 13.8b. The predictor nodes are labeled X () , X\ . X and are located on the left 

side of the diagram. In the center of the diagram are m hidden nodes. These nodes are 
linked to the p predictor nodes by relation (13.41); thus the links are labeled by using the 
a parameters. Finally, the hidden nodes are linked to the response Y by the ^ parameters. 


Comments 

1. Neural networks were first used as models for the human brain. The nodes represented neurons 
and the links between neurons represented .synapses. A synapse would "fire' - if the signal surpassed 


FIGURE 13.8 (a) Linear Regression Model (b) Neural Network Model 






Chapter 13 Introduction to Nonlinear Regression and Neural Networks 541 


a threshold. This suggested the use of step functions for the activation function, which were later 
replaced by smooth functions such as the logistic function. 

2. The logistic activation function is sometimes replaced by a radial basis function, which is an 
rt-dimensional normal probability density function. Details are provided in Reference 13.8. ■ 


陳 Ural Network as Generalization of Linear Regression 

乂 " ‘ * It is easy to see that the standard multiple regression model is a special case of neural 

network model (13.45). If we choose for each of the activation functions 尺卜 … ， g m _i 
the identity activation: 


g{Z) — Z 


we have: 


二 A) + + • ， • + Pm-iHi'tn-i 


and: 


Hij = ajo + oij'X “ + … + 

Substitution of (13f46b) into (13.46a) and rearranging yields: 


(13.46a) 


(13.46b) 


E{Yi )= 


m—l 


Pj 、 


m—l 




兄 I + 


m—l 




j ， p-i 


Xi, 


p-l 


—^0 + ^ 1 + ■ • • + Pp-i^i,p-\ 


(13.47) 


where: 


m—\ 


=A)+ 


(13.47a) 


m—\ 


K = ^j^jk for A; = 1,..p — 1 
j=i 


The neural network with identity activation functions thus reduces to the standard linear 
regression model. 

There is a problem, however, with the interpretation of the neural network regression 
coefficients. If the regression function is given by ^{F/}= 

indicated in (13.47), then any set of neural network parameters satisfying the p equations in 
(13.47a) gives the correct model. Since there are many more neural network parameters than 
there are equations (or equivalently, p* parameters) there are infinitely many sets of neural 
network parameters that lead to the correct model. Thus, any particular set of neural network 
parameters will have no intrinsic meaning in this case. 

This overparameterization problem is somewhat reduced with the use of the logistic 
activation function in place of the identity function. Generally, however, if the number of 
hidden nodes is more than just a few, overparameterization will be present, and will lead to 
a fitted model with low predictive ability unless this issue is explicitly considered when the 
parameters are estimated. We now take up such estimation procedures. 



542 Part Three Nonlinear Regression 


Parameter Estimation: Penalized Least Squares 

lnChapter9 weconsidered model selection and validation. There, we observed that while/j2 
never decreases with the nddition of a new predictor, our ability to predict holdout respo nses 
in the validation stage can deteriorate if ton many predictors are incorporated Various 
selection criteria, such as R~ SBC l>t and/\/C /f , have been adopted that contain penalties^ 
the addition of predictors. We commented in Section 11.2 that ridge regression estimate 
can be obtained by the method oi' penalized least squares, which directly incorporates a 
penalty for the sum of squares of the regression coefficients. In order to control the level# 
overtitting, penalized least squares is frequently used for parameter estimation with neural 
networks. 

The penalized least squares criterion is given by: 

tl 

Q = A y i ~ /(X, , P- Oil - - «m-l ] 2 + /Ja(P- «1. Oi„ … ） (13.48) 

i=.\ 

where the overfit penalty is: 


Oi| . a 〃卜 i) 二入 


nt—\ 


in— I p— I 


Ea 2 + ZE^ 

/ =0 =丨 j —0 


(13.48a) 


Thus, the penalty is a positive constant, 入 ， times the sum of squares of the non lineal* regres¬ 
sion coefficients. Note that the penalty is imposed not on the number of parameters 
but on the total magnitude of the parameters. The penalty weight k assigned to the regres¬ 
sion coefficients governs the trade-off between overfitting and underfitt’mg. If 入 is lai^e, 
the parameters estimates will be relatively smull in absolute magnitude; if 入 is small, the 
estimates will be relatively large. A “best” value for 入 is generally between .001 and .1 and 
is chosen by cross-validation. For example, we may fit the model for a range of 入 -values 
between .001 and .1, and choose the value that minimizes the total prediction error of the 
hold-out sample. The resulting parameter estimates are called shrinkage estimates because 
use of •入 > 0 leads to reductions in their absolute magnitudes. 

In Section 13.3 we described various search procedures, such as the Gauss-Newton 
method for finding nonlinear least squares estimates. Such methods can also be used with 
neural networks and penalized least squares criterion (13.48). We observed in Comment 1 on 
page 524, that the choice of starting values is important. Poor choice of starting values may 
lead to convergence to a local minimum (rather than the global minimum) when multiple 
minima exist. The problem of multiple minima is especially prevalent when fitting neural 
networks, due to the typically large numbers of parameters and the functional form of model 
(13.48). For this reason, it is common practice to fit the model many times (typically between 
10 and 50 times) using different sets of randomly chosen starting values for each fit. The set 
of parameter estimates that leads to the lowest value ot*criterion function (13.48) — i,e„ the 
bes;t of the best — is chosen for further study. In the neural networks literature, finding a set 
of parameter values that minimize criterion (13.48) is referred to as training the network. 
The number of searches conducted before arriving at the final estimates is referred to as the 
number of tours. 





Chapter 1 3 Introduction to Nonlinear Regression and Neural NetwoHcs 543 


Comment 

Neural networks are often trained by a procedure called back-propagation. Back propagation is in 
fact the method of steepest descent, which can be very slow. Recommended methods include the 
conjugate gradient and variable metric methods. Reference 13.8 provides further details concerning 
back-propagation and other search procedures. ■ 

pample: Ischemic Heart Disease 

We illustrate the use of neural network model (13.44) and the penalized least squares fitting 
procedure using the Ischemic heart disease data set in Appendix C.9. These data were 
collected by a health insurance plan and provide information concerning 788 subscribers 
who made claims resulting from coronary heart disease. The response (K) is the natural 
logarithm of the total cost of services provided and the predictors to be studied here are: 


Predictor 

Description - 


Number of interventions, or procedures, carried out 

x 2 ： 

Number of tracked drugs used 

入 3:〆 

Number of comorbidities —— other conditions present 

， 

that complicate the treatment 

x 4 ： 

Number of complications — other conditions that 
arose during treatment due to heart disease 


The first400 observations are used to fit model (13.45) and the last = 388 observations 
were held out for validation. (Note that the observations were originally sorted in a random 
order, so that the hold-out data set is a random sample.) We used JMP to fit and evaluate 
the neural network model. 

Shown in Figure 13.9 is the JMP control panel, which allows the user to specify the var¬ 
ious characteristics of the model and the fitting procedure. Here, we have chosen 5 hidden 
nodes, and we are using X = .05 as the penalty weight. Also, we have chosen the default val¬ 
ues for the number of tours (20), the maximum number of iterations for the search procedure 


FIGURE 13.9 
JMP Control 
Panel for 
Neural 
Netffcwk 
Fit—Ischemic 
Heart Disease 
Example. 


Control Panel 

?! 

i 

: 

Specify j 

Hidden Nodes 

5 ； 

Overfit Penalty 

ao5| 

Number of Tours 

20l 

Max Iterations 

50 

Converge Criterion 

0.00001, 

MLog the tours 

1 

门 Log the iterations 

[ 

口 Log the estimates 

i 

i 

ClSave iterations in table j 









544 Part Three Nonlinear Re^ivssion 


FIGURE 13.10 
JMP Neural 
Network 
Diagram — 
Ischemic Heart 
Disease 
Example. 



FIGURE 13.11 
JMP Results 
for Neural 
Network 
Fit — Ischemic 
Heart Disease 
Example. 


Results 

Objective 
SSE 120.90315177 

Penalty 4.4087731663 

Total 125.31192493 

Y SSE 


17 Converged At Best 
2 Converged Worse Than Best 
0 Stuck on Flat 
0 Failed to Improve 
1 Reached Max Iter 

SSE Scaled SSE Excluded RMSE RSquare RSquare Excludec 


logCost 441,3037691 120.90315177 407,68215505 0.55465449 0.6962 0,7024 


(50) and the convergence criterion (.00001). By checking the “log the tours'' box, we will 
be keeping a record of the results of each of the 20 tours. A JMP network representation of 
model (13.45) is shown in Figure 13.10. Note that this representation excludes the constant 
nodes X 。 and /V In our notation, there are m = 6 hidden nodes and p = 5 predictor nodes, 
and it is necessary to estimate m + p(ni — 1) = 6 + 5(6 — 1) = 31 parameters. 

The results of the best fit, after 20 attempts or tours, is shown in Figure 13.11. The 
penalized least squares criterion value is 125.31. SSE for the scaled response is 120.90. 
JMP indicates that the corresponding SSE for the unsealed (original) responses is 441.30. 
The total prediction error for the validation (excluded) data, is given here by: 

788 

SSE[/ AL — 〉: (Yj — Yj)~ = 407.68 

； ~40t 

The mean squared prediction error (9.20) is obtained as MSPR = SSE val /i^ = 407.68 / 
388 = 1.05. JMP also gives R' for the training data (.6962). and for the validation data 



Chapter 1 3 Introduction to Nonlinear Regression and Neural Networks 545 


pP 

tesfor 



lenuc 

jfeart Disease 

Example. 


::: -:r 

^^Effameter Estimates 

二 "：: '： !* ，!, I :!• it -： • 

I Parameter —丄 ■ Estiinatc 
HI intercept 0.3216346311 
H2:lntercept 1.2553122156 
H3:lntercept 2.5829942469 
H4:lntercept -1.505357347 
H5:tntercept -1.832118976 
Hi : Duration -0.410405493 ； 
HI interventions 2.769411800^ 
HI : Comortids 1,38230806 ― 
HI : Complfcations 0.4l48S83852j 
H2:Duration 0.1040924583] 


H2:lntervent[ons 0.983043751] 
j H2：Comorbids 2.35896280161 
j H2:Comptications -0.201333282( 
j H3:Duration 1.502529975^ 
j H3:lnterventIons 1.0761596691 
j H3KDomorbids -0,414620124 
J H3:Complicatlons 0.05439404C6 
[H4:Duration 1.2332218124 


J H4:Im(lerventions -4.887856867! 
i H4:Comorbids .576610999| 


H4：Compl [cations -1.068032684) 
H5:Duratbn -0.159788267! 

H5：l nterventlons 1.25624454^91 

H5:Comorbids 0.1951585624 

H5 ： Compticat[ons CL3717883109| 
logCost:tntercept -0.443318204j 
;logCost ： H1 -2.165864717| 

logCost;H2 1.4877032149| 

togCostH3 1.5396831425* 

j iogCosUH4 »2.285420806i 

J togCosJt:H5 1,68228841 ?j 


(.7024). This latter diagnostic was obtained using: 


l VAL 


sse val 

SST YAL 


where SST VAL is the total sum of squares for the validation data. Because these R 2 values 
are approximately equal, we conclude that the use of weight penalty X = .05 led to a good 
balance between underfitting and overfitting. 

Figure 13.12 shows the 31 parameter estimates produced by JMP and the corresponding 
parameters. We display these values only for completeness-we make no attempt at inter¬ 
pretation. As noted earlier, our interest is centered on the prediction of future responses. 

For comparison, two least squares regressions of Y on the four predictors X[, X 2 , X 3> 
and X 4 were also carried out. The first was based on a first-order model consisting of the 
four predictors and an intercept term; the second was based on a full second-order model 
consisting of an intercept plus the four linear terms, the four quadratic terms,, and the six 
cross-products among the four predictors. The results for these two multiple regression 
models and the neural network model are summarized in the Table 13.6. * 

From the results, we see that the neural network model’s ability to predict holdout 
responses is superior to the first-order multiple regression and slightly better that the second- 
order multiple regression model. MSPR for the neural network is 1.05, whereas this statistic 
for the first and second-order multiple regression models is 1.28 and 1.09, respectively. 





546 Part Three Nonlinear Regression 



160 


FIGURE 1B.13 

Conditional 

Effects 

Plot — Ischemic 
Heart Disease 
Example. 


4.5 tj_I_I_I_i_ l 

0 10 20 30 40 50 

Interventions 


Model Interpretation and Prediction 

While individual parameters and derived predictors are usually not interpretable, some 
understanding of the effects of individual predictors can be realized through the use of 
conditional effects plots. For example. Figure 13.13 shows for the ischemic heart data 
example, plots 6f predicted response as a function the number of interventions (X?) for 
duration (X t ) equal to 0 and 160, The remaining predictors, comorbidities (Xs=3,55) 
and complications (X 4 = 0.05), are fixed at their averages for values in the tmining set 
The plot indicates that the natural logarithm of cost increases rapidly as the number of 
interventions increases from 0 to 25, and then reaches a plateau and is stable as the numbo" 
of interventions increases from 25 to 50. The duration variable seems to have very little 
effect, except possibly when interventions are between 5 and 10. 

We have noted that neural network models can be very effective tools for prediction when 
large data sets are available. As always, it is important that the uncertainty in any prediction 
be quantified. Methods for producing approximate confidence intervals for estimation and 
prediction have been developed and some packages such as JMP now provide these intervals. 
Details are provided in Reference 13.9. 


TABLE 13.6 
Comparisons 
of Results for 
Neural 

Network Model 
with Multiple 
Linear 
Regression 
Model — 
Ischemic Heart 
Disease 
Example. 


Number of Parameters 

MSE 

MSPR 


Neural Network 
31 

1.20 

1.05 


Multiple Linear Regression 


First-Order 

5 

1.74 

1.28 




d 


■ 5 5' 5 5 
8 - 7 - 6 - 5 - 

-ou^ol p^-pajd 



Chapter 1 3 Introduction to Nonlinear Regression and Neural Networks 547 


Soin e Final Comments on Neural Network Modeling 

In recent years, neural networks have found widespread application in many fields. Indeed, 
they have become one of the standard tools in the field of data mining, and their use continues 
to grow. This is due largely to the widespread availability of powerful computers that permit 
the fitting of complex models having dozens, hundreds, and even thousands, of parameters. 

A vocabulary has developed that is unique to the field of neural networks. The table below 
(adapted from Ref. 13.10) lists a number of terms that are commonly used by statisticians 
and their neural network equivalents: 


Statistical Term 

coefficient 
predictor 
response 
observation 
parameter estimation 
steepest descent 
intercept 
derived predictor 
penalty function 


Neural Network Term 

weight 

input 

output 

exemplar 

training or learning 
back-propagation 
bias term 
hidden node 
weight decay 


There are a number of advantages to the neural network modeling approach. These 
include: 

1. Model (13.45) is extremely flexible, and can be used to represent a wide range of response 
surface shapes. For example, with sufficient data, curvatures, interactions, plateaus, and 
step functions can be effectively modeled. 

2. Standard regression assumptions, such as the requirements that the true residuals are 
mutually independent, normally distributed, and have constant variance, are not required 
for neural network modeling. 

3. Outliers in the response and predictors can still have a detrimental effect on the fit of the 
model, but the use of the bounded logistic activation function tends to limit the influence 
of individual cases in comparison with standard r^ression approaches. 

Of course, there are disadvantages associated with the use of neural networks. Model 
parameters are generally uninterpretable, and the method depends on the availability of 
large data sets. Diagnostics, such as lack of fit tests, identification of influential observations 
and outliers, and significance testing for the effects of the various predictors, are currently 
not generally available. 


Cited 

References 


13.1. Hartley, H. O. “The Modified Gauss-Newton Method for the Fitting of Non-linear Regression 
Functions by Least Squares, 55 Technometrics 3 (1961), pp. 269-80. 

13.2. Gallant, A. R. Nonlinear Statistical Models. New York: John Wiley & Sons，1987. 

13.3. Kennedy, W. J., Jr., andJ.E. Gentle. Statistical Computing. New York: Marcel Dekker, 1980. 

13.4. Bates, D. M., and D. G. Watts. Nonlinear Regression Analysis and Its Applications. New York: 
John WJey & Sons, 1988. 


548 Part Three Nonlinear Regression 


13.5. 


13.6. 


13.7. 

13.8. 

13.9. 
13 . 10 . 


M. J. “Bias in Nonlinear Estimation.'' Journal of the Royal Stcnlstlcal Society £33 ( 19*7 < 

% 

ty B 47 (1985), ； 


Box. 

pp, 171-201. % 

HougiianU R "The Appropriateness of the Asymptolic Distribution in a Nonlinear R - 
sion Model in Relation U) Curvature/ - Journal of ihe Royal Statistical Society ^ & ：t ^ \ 

pp. 103-14, 

Ratkowsky, D. A. Nonli/iear Re^ressh)ii Moclellu^. New York: Marcel Dckker 1933 
Htisiie, X, Tibshiicini, R„ and J, Friedman, The Elemcnis oj SiotLwiical Lear/fin^: Data^ f 
Inferences tuul Prediction. New York: Springer. 20() L 

DeVeaux, R, D„ Schumi, J.. Schweinsberg* J.. and L. HL Ung ⑴' ^Prediction Intervals f 
Neural Networks via Nonlinear Regression/' Technoincincs 40 (1998), pp, 273-82 
DeVeaux* R. D„ anti L. HL Ungar. fc *A Brief Introduction to Neural Networks, 1 
www,willhims,edu/mathcmatics/rdevciiux/pubs.html (19%). ’ 


Problems 


■^ 13 . 1 . 


13.2. 


For eiich of the following response functions, indicate whether it is a linear I'esj^nse function, 
an intrinsically linear response function, or a nonlinear response funetiqn. In the case of an in ■- 
trinsically linea 疒 itsponse function, state how it can be linearized by y suitable transfonriatiai- 

/(X, y) = exp(y ( , + y\X) 

b. /(X, y) = y„ + ri(/ 2 )' V| - y ： ,X 2 

c. / (X. y) = ]/() + — X 

' Yo 


For each of the following response functions, indiciite whether it is a linear response function, 
an intlinsicully linear response function, ora nonlinear response function. In the case of an in¬ 
trinsically linear response function, state how it can be linearized by a suitable transfomiation- 


a. /(X. y) = exp(y ( , + log r X) 

b. /(X. y) = y ( ,(X,) n (X 2 F- 
c /(X. y) = j/,, - Ki(y：) x 

：i： 13.3. a. Plot the logistic response function: 


/(X-V) = 


_300_ 

1 + (30) exp (— 1.5X) 


X > 0 


b. What is the asymptote ol' this response function? For what value of X does the response 
function reach 90 percent ol" its asymptote? 

13.4. y. Plot the exponential response function: 


/(X. y) = 49 - (30) exp (- U X) X > 0 

b. What is the asymptote ol' this response runction? For what value of" X does the response 
function rcuch 95 percent ol' its asymptote? 

^13.5. Home computers. A computer manufacturer hired a murket resenrch Hnn to investigatetiie ； 
relation.ship be i ween the likelihood a family will purchiise a home computer anti the price of ： 
the home computer. The data that follow are based on replicate surveys clone in two similar ； 
cities. One thousand hey els of households in each city were randomly selected and asked 
they would be likely to purchase a home computer at a given price. Bight piices (X. in dollars^, 
were studied, and 100 heads of households in each city were randomly assigned to a giveif 
price. The proportion likely to purchase at a given price is denoted by Y. 



Chapter 1 3 Introduction to Nonlinear Regression and Neural Networks 549 


City A 


1 •: 

1 

2 

3 

4 

5 

6 

7 

8 

X；: 

200 

400 

800 

1200 

1600 

2000 

3000 

4000 

Yr. 

■65 

.46 

.34 

.26 

.17 

.15 

.06 

.04 





City B 





i: 

9 

10 

11 

12 

13 

14 

15 

16 

X；: 

200 

400 

800 

1200 

1600 

2000 

3000 

4000 

Y；: 

■63 

-50 

•30 

.24 

.19 

.12 

.08 

.05 


No location effect is expected and the data are to be treated as independent replicates at each of^. 
the 8 prices. The following exponential model with independent normal error terms is deemed f 
to be appropriate: 

^ = 7 o + 72 exp(-yiXf) +e f - 

a. To obtain initial estimates of y 0 , y,, andy 2 , note that /(X, y) approaches a lower asymptote 

Yo as X increases without bound. Hence, let gf = 0 and observe that when we ignore the 
error term, a logarithmic transformation then yields Y! = + fiiX；, where Y! = log e Y；, 

Pq= log 况朽 ， and = — y 卜 Hierefore, fit a linear regression function based on the trans¬ 

formed data and use as initial estimates ^q 0) = 0, ^{ 0) = —bi, and ^ 0) = exp(^o)- 

b. Using the starting values obtained in part (a), find the least squares estimates of the param¬ 
eters y 0 , yi, mdy 2 . 

*13.6. Refer to Home computers Problem 13.5. 

a. Plot the estimated nonlinear regression function and the data. Does the fit appear to be 
adequate? 

b. Obtain the residuals and plot them against the fitted values and against X on separate 
graphs. Also obtain a normal probability plot. Does the model appear to be adequate? 

*13.7. Refer to Home computers Problem 13.5. Assume that laige-sample inferences are appropriate 
here. Conduct a formal approximate test for lack of fit of the nonlinear regression function; 
use a — .01. State the alternatives, decision rule, and conclusion. 

*13.8. Refer to Home computers Problem 13.5. Assume that the fitted model is appropriate and 
that large-sample inferences^can be employed. Obtain approximate joint confidence intervals 
for the parameters /o, Yi, and y 2 , using the Bonferroni procedure and a 90 percent family 
confidence coefficient. 

*13.9. Refer to Home computers Problem 13.5. A question has been raised whether the two cities 
are similar enough so that the data can be considered to be replicates. Adding a location 
effect parameter analogous to (13.38) to the model proposed in Problem 13.5 yields the four- 
parameter nonlinear regression model: 

= Ko + 73^/2 + Yi exp(—yiX；|) +' £/ 

where: 

_ f 0 if city A 4 

2 ~\l if city B , » 

a. Using the same starting values as those obtained in Problem 13.5a and ^ 0) = 0, find the 

least squares estimates of the parameters y,, y 2 , and 

b. Assume that laige-sample inferences can be employed reasonably here. Obtain an approx¬ 
imate 95 percent confidence interval for y 3 . What does this interval indicate about city 



550 Part Three Nonlinear Regression 


a. To obtain starting values for y {) and y { , observe that when the eiTor term is ignored we have 

yi = A) + ^i x> n where K / = 1/K,-, = 1/Xo, — yi/yn, and X] = 1/X,-. Therefore 

a linear regression function to the transformed data to obtain initial estimates = j/f, 
and^l 0 ' = b x /b () . ° 

b. Using the starting values obtained in part (a), find the least squares estimates of the param- 
etei'S and yi . 

13.11. Refer to Enzyme kinetics Problem 13.10. 

a. Plot the estimated nonlinear regression function and the data. Does the fit appear to be 
adequate? 

b. Obtain the residuals and plot them against the fitted values and against X on separate 
graphs. Also obtain a normal probability plot. What do your plots show? 

c. Can you conduct an approximate formal lack of fit test here? Explain. 

d. Given that only 18 trials can be made, what are some advantages and disadvantages of con¬ 
sidering fewer concentration levels but with some replications, as compared to considering 
18 different concentration levels as was done here? 

13.12. Refer to Enzyme kinetics Problem 13.10. Assume that the fitted model is appropriate and 
that large-sample inferences can be employed here. (1) Obtain an approximate 95 percent 
confidence interval for (2) Test whether or not/i = 20; use a = .05. State the alternatives, 
decision rule, and conclusion. 

*13.13. Drug responsiveness. A pharmacologist modeled the responsiveness to a drug using the 
following nonlinear regression model: 

—-71^- 

1 + U) 

X denotes the dose level, in coded form, and Y the responsiveness expressed as a percent cf 
the maximum possible responsiveness. In the model , 沖 is the expected response at saturation, 
yi is the concentration that produces a half-maximal response, and yi is related to the slope. 
The data for 19 cases at 13 dose levels follow: 

/: 1 2 3 ... 17 18 19 


differences? Is this result consistent with your conclusion in Problem 13J?Does 
to be? Discuss. 聊 

13 JO. Enzyme kinetics. In an enzyme kinetics study the velocity of a reaction (K) is expected to b 
related to the concentration (X) as follows: 




+ 


Yi + 

Eighteen concentrations have been studied and the results follow: 


/: 


2 


3 


16 


17 


18 


6 
o L 
4 2 


3521 


0 9 
3 Ti 


' 9 

2 4 ' 


' 5 ' 5 

12 


1 — 2 


" 


4 』 

9 6 ' 

9 


■ 2 

8 6 - 

9 


8' 

7 4 ' 

9 


' 4 

3 rv! 


3 

2 2 ' 


5 - 



Chapter 13 Introduction to Nonlinear Regression and Neural Networks 551 


Obtain least squares estimates of the parameters yo, Y\, and y 2t using starting values gf)= 
100, gf } = 5, and g 0 ) = 4.8. 

*13.14. Refer to Drug responsiveness Problem 13.13. 

a. Plot the estimated nonlinear regression function and the data. Does the fit appear to be 
adequate? 

b. Obtain the residuals and plot them against the fitted values and against X on separate graphs. 
Also obtain a normal probability plot. What do your plots show about the adequacy of the 
regression model? 

*13.15. Refer to Drug responsiveness ProWem 13.13. Assume that large-sample inferences are ap¬ 
propriate here. Conduct a formal approximate test for lack of fit of the nonlinear regression 
function; use a — .01. State the alternatives, decision rule, and conclusion. 

*13.16. Refer to Drug responsiveness Problem 13J3. Assume that the fitted model is appropriate 
and that laige-sample inferences can be employed here. Obtain approximate joint confidence 
intervals for the parameters y 0 , y,, and y 2 using the Bonferroni procedure with a 91 percent 
family confidence coefficient. Interpret your results. J 

13.17. Process yield. Hie yield (7) of a chemical process depends on the temperature (X t ) and 
pressure (X〗). The following nonlinear regression model is expected to be applicable: 

， Yi = MXny i (X i2 )^-b £ ； 

Prior to beginning full-scale production, 18 tests were undertaken to study the process yield 
for various temperature and pressure combinations. The results follow. 


i: 

1 

2 

3 


16 

17 

18 


1 

10 

100 

.. 

1 

10 

100 

X/2: 

1 

1 

1 

_ _ _ 

100 

100 

100 

Yr- 

12 

32 

103 

_ _ _ 

43 

128 

398 


a. Ib obtain starting values for y 0 , y,, mdy 2 , note that when we ignore the random error term, 

a logarithmic transformation yields Y' ; : =A) + + where Y ； = log I 0 y ; ,j 6 0 = 

log , 0 y 0 , J3 t = yi, X；, = Iog I0 X !u = Yi, and X' n = Iog I0 X； 2 . Fita first-order multiple 
regression model to the transformed data, and use as starting values g[ 0) = antilog , 0 b 0 , 
^J 0) = bi, andg 0) =〜 

b. Using the starting values obtained in part (a), find the least squares estimates of the param¬ 
eters y 0 , yi, and/ 2 - 

13.18. Refer to Process yield Problem 13.17. 

a. Plot the estimated nonlinear regression function and the data. Does the fit appear to be 
adequate? 

b. Obtain the residuals and plot them against Y,X x , and X2 on separate graphs. Also obtain 
a normal probability plot. What do your plots show about the adequacy of the model? 

13.19. Refer to Process yield Problem 13.17. Assume that large-sample inferences are appropriate 

here. Conduct a formal approximate test for lack of fit of the nonlinear regression function; 
use a = .05. State the alternatives, decision rule, and conclusion. " 

13.20. Refer to Process yield Problem 13.17. Assume that the fitted model is appropriate and that 
large-sample inferences are applicable here. 

a. Test the hypotheses H 0 : yi = yi against H a ：yi 7 ^ Yi using a = .05. State the alternatives, 
decision rule, and conclusion. 



552 Part Three Nonlinear Re^irssion 


Exercises 


Projects 


b. 


Obtain approximate joint confidence intervals for the parameters and 
Bonferroni procedure and a 95 percent family confidence coefficient. 


A Using 如: 

- \ 


c. What do you conclude about the parameters y x and based on the results in 
and (b)? 


Parts 


I3,2L (Calculus needed,) Refer to Home computers Problem 13,5, ^ 

a. Obtain the least squares normal equations and show that they are nonlinear in the esti '' 

regression coefficients go, Si-> a nd ^ Mted s 

b. State the likelihood function for the nonlinear regression model，assumine that the 〔 

terms are independent N((), a 2 ). ^ 

13.22. (Calculus needed.) Refer to Enzyme kinetics Problem 13.10. i. 

a. Obtain the least squares normal equations and show that they are nonlinear in the estimated ^ 

regression coefficients go and ^|. ^ 

b. State the likelihood function for the nonlinear regression model, assuming that the emj 

terms are independent N{{), cr 2 ). ' 

13.23. (Calculus needed.) Refer to Process yield Problem 13.17. 


a. Obtain the least squares normal equations and show that they are nonlinear in the estimated® 
regression coefficients go, g t , and g 2 . 

b. State the likelihood function for the nonlinear regression model, assuming that the etrcr 
terms are independent ^(0, a 2 ). 

13.24. Refer to Drug responsiveness Problem 13.13. 

a. Assuming that Eie；} = 0, show that: 

E{Y} =^(jh) 

where: 


A = exp[y,(log e X - log e y 2 )] = exp(j6 0 + W) 

andjKo = -y\ log e 72, t^\ = yi, and X' = log e X. 

b. Assuming y 0 is known, show that: 

E{r\ f 

! _ E{r] =exp(A> + ^iX) 

where Y f = Y /y (( . 

c. What transformation do these results suggest for obtaining a simple linear regresaon 
function in the transformed variables? 

d. How can starting valuer for finding the least squares estimates of the nonlinear regression 
parameters be obtained from the estimates of the linear Egression coefficients? 


13-25. Refer to Enzyme kinetics Problem 13.10. Starting values for finding the least squares estimates 
of the nonlinear regression model parameters are to be obtained by a grid search. The following 
bounds for the two parameters have been specified: 

5 < yo < 65 
5 < yi < 65 



Chapter 13 Introduction to Nonlinear Regression and Neural Networks 553 


Obtain 49 grid points by using all possible combinations of the boundary values and five other 
equally spaced points for each parameter range. Evaluate the least squares criterion (13.15) 
for each grid point and identify the point providing the best fit. Does this point give reasonable 
starting values here? 

13.26. Refer to Process yield Problem 13.17. Starting values for finding the least squares estimates of 
the nonlinear regression model parameters are to be obtained by a grid search. The following 
bounds for the parameters have been postulated: 

1 < /o < 21 
.2 < yi < .8 

-1 < X2 < -7 

Obtain 27 grid points by using all possible combinations of the boundary values and the 
midpoint for each of the parameter ranges. Evaluate the least squares criterion (13.15) for 
each grid point and identify the point providing the best fit. Does this point give reasonable 
starting values here? 

13.27. Refer to Home computers Problem 13.5. ^ 

a. To check on the appropriateness of large-sample inferences here, generate 1,000 bootstrap 
samples of size 16 using the fixed X sampling procedure. For each bootstrap sample, obtain 
the least squares estimates g^, g*, and g^. 

b. Plot histograms of the bootstrap sampling distributions of g^, g*, and Do these distri¬ 
butions appear to be approximately normal? 

c. Compute the means and standard deviations of the bootstrap sampling distributions for g^, 
g* 7 and g 2 - Are the bootstrap means and standard deviations close to the final least squares 
estimates? 

d Obtain a confidence interval for y x using the reflection method in (11.59) and confidence 
coefficient .9667. How does this interval compare with the one obtained in Problem 13.8- 
by the large-sample inference method? 

e. What are the implications of your findings in parts (b), (c), and (d) about the appropriateness 
of large-sample inferences here? Discuss. 

13.28. Refer to Enzyme kinetics Problem 13 JO. 

a. To check on the appropriateness of lai^e-sample inferences here, generate 1,000 bootstrap 
samples of size 18 using the fixed X sampling procedure. For each bootstrap sample, obtain 
the least squares estimates 給 and 

b. Plot histograms of the bootstrap sampling distributions of and^*. Do these distributions 
appear to be approximately normal? 

c. Compute the means and standard deviations of the bootstrap sampling distributions for^* 
and Are the bootstrap means and standard deviations close to the final least squares 
estimates? 

d Obtain a confidence interval for y 0 using the reflection method in (11 -59) and confidence 
coefficient .95. How does this interval compare with the one obtained in Problem 13.12 by 
the large-sample inference method? 

e. What are the implications of your findings in parts (b), (c), and (d) about the appropriateness 
of lai^e-sample inferences here? Discuss. ^ 、 

13.29. Refer to Drug responsiveness Problem 13.13. 

v- 

a. To check on the appropriateness of large-sample inferences here, generate 1,000 bootstrap 
samples of size 19 using the fixed X sampling procedure. For each bootstrap sample，obtain 
the least squares estimates g^, g*, and g^. 



554 Part Three Nonlinear Regress;i(m 


Case 

Studies 


b. Plot histograms of the bootstrap sampling distributions of g* Q , gf, and g*. Do these distri 
butions appear to be approximately normal? 

c. Compute the means and standard deviations of the bootstrap sampling distributions for * 
gj, and ^ 2 - Are the bootstrap means «nd standard deviations close to the final least sq uar ^ 
estimates? 

ds Obtain a conscience interval for yi using the reflection method in (11.59) and confidence 
coefficient .97. How does this interval compare with the one obtained in Problem 13 by 
the large-sample inference method? 

e. What are the implications of your findings in parts (b), (c), and (d) about the appropriateness 
of large-sample inferences here? Discuss. 

13.30. Refer to Process yield Problem 13.17. 

a. To check on the appropriateness of large-sample inferences here, generate 1,00() bootstrap 
samples of size 18 using the fixed X sampling procedure. For each bootstrap sample, obtain 
the least squares estimates g^, g*, and 

b. Plot histograms of the bootstrap sampling distributions of g*, and Do these distri¬ 
butions appear to be approximately normal? b 

c. Compute the means and standard deviations of the bootstrap sampling distributions for g* 

and Are the bootstrap means and standard deviations close to the final least squares 
estimates? 


d. Obtain a confidence interval for y! using the reflection method in (11.59) and confidence 
coefficient .975. How does this interval compare with the one obtained in Problem 13,20b 
by the large-sample inference method? 

e. What are the implications ofyour findings in parts (b), (c), and (d) about the appropriateness 
of large-sample inferences here? Discuss. 


13,31. Refer to the Prostate cancer data set in Appendix C.5 and Case Study 9.30. Select a random 
sample of 65 observations to use as the model-building data set. 

a. Develop a neural network model for predicting PSA. Justify your choice of number of 
hidden nodes and penalty function weight and interpret your model. 

b. Assess your model’s ability to predict and discuss its usefulness to the oncologists. 

c. Compare the performance of your neural network model with that of the best regression 
model obtained in Case Study 9.30. Which model is more easily interpreted and why? 


1332. Refer to the Real estate sales data set in Appendix C.7 and Case Study 9.31. Select a random 

sample of 3(K) observations to use as the model-building data set. 

a. Develop a neural network model for predicting sales price. Justify your choice of number 
of hidden nodes and penalty function weight and interpret your model. 

b. Assess your model's ability to predict and discuss its usefulness as a tool for precUcting 
sales prices. 

c. Compare the performance of your neural network model with that of the best regression 
model obtained in Case Study 9.31. Which model is more easily interpreted and why? 



Logistic Regression ， 

Poisson Regression^ ' 
and Generalized ^ 

_r 

Linear Models 

〆 

In Chapter 13 we considered nonlinear regression models where the error terms are normally 
distributed. In this chapter, we take up nonlinear regression models for two important cases 
where the response outcomes are discrete and the error terms are not normally distributed. 
First, we consider the logistic nonlinear regression model for use when the response variable 
is qualitative with two possible outcomes, such as financial status of firm (sound status, 
headed toward insolvency) or blood pressure status (high blood pressure, not high blood 
pressure). We then extend this model so that it can be applied when the response variable is 
a qualitative variable having more than two possible outcomes; for instance, blood pressure 
status might be classified as high, normal, or low. 

Next we take up the Poisson regression model for use when the response variable is 
a count where large counts are rare events, such as the number of tornadoes in an upper 
Midwest locality during a year. Finally, we explain that nearly all of the nonlinear regression 
models discussed in Chapter 13 and in this chapter, as well as the normal error linear models 
discussed earlier, belong to a family of regression models called generalized linear models. 

The nonlinear regression models presented in this chapter are appropriate for analyzing 
data arising from either observational studies or from experimental studies. 

14.1 Regression Models withJBinary Response Variable_ 

In a variety of regression applications, the response* variable of interest has only two possible 
qualitative outcomes, and therefore can be represented by a binary indicator variable taking 
on values 0 and 1. 

•w 

1. In an analysis of whether or not business firms have an industrial relations depart¬ 
ment, according to size of firm, the response variable was defined to have the two possible 

555 



556 Part Three Nonlinear Regression 


outcomes: firm has industrial relations department, firm does not have industrial relations 
department. These outcomes may be coded 1 and 0, respectively (or vice versa). 


2 . In a study of labor force participation of married women, as a function of age, number 
of children, and husband’s income, the response variable Y was defined to have the two 
possible outcomes: married woman in labor force, married woman not in labor force Agaj^ 
these outcomes may be coded 1 and 0 , respectively. 


3. In a study of liability insurance possession, according to age of head of household 
amount of liquid assets, and type of occupation of head of household, the response variable Y 
was defined to have the two possible outcomes: household has liability insurance, household 
does not have liability insurance. These outcomes again may be coded 1 and 0, respectively 

4. In a longitudinal study of coronary heart disease as a function of age, gender, smoking 

history, cholesterol level, percent of ideal body weight, and blood pressure, the response 
variable Y was defined to have the two possible outcomes: person developed heart disease 
during the study, person did not develop heart disease during the study. These outcomes 
again may be coded 1 and 0 , respectively. ^ 

These examples show the wide range of applications in which the response variable is 


binary and hence may be represented by an indicator variable. A binary response variable, 
taking on the values 0 and 1 , is said to involve binary responses or dichotomous responses 
We consider first the meaning of the response function when the outcome variable is binary, 
and then we take up some special problems that arise with this type of response variable. 


Meaning of Response Function when Outcome Variable Is Binary 

Consider the simple linear regression model: 

^ = + +£,• Y ； = 0, 1 (14.1) 

where the outcome F, is binary, taking on the value of either 0 or I. The expected respense 
E{Y,) has a special meaning in this case. Since E{e- t } = 0 we have: 

五 + (14.2) 

Consider F, to be a Bernoulli random variable for which we can state the probability 
distribution as follows: 


Yi 

Probability 

1 

P(Yi = 1 )= 巧 

0 

P(Y t =0)—1 ~~ 7Ti 


Thus, 7T,- is the probability that F,- — 1, and 1 — ji ； is the probability that Y ； = 0. By the 
definition of expected value of a random variable in (A. 12), we obtain: 

E{Y,}= 1 ( 71 ,) + 0(1 — TT,) = 71 , 二 P(Yi = 1) (143) 

Equating (14.2) and (14.3), we thus find: 

五 mi = A) +AA = A ( 14 . 4) 



Chapter 14 Logistic Regression% Poisson Regression，and Generalized Linear Models 557 


Blpstration of 
Response 

Unction when 


Response 

Variable Is 
ginaiy— 

Industrial 

jtdations 


department 

Example. 


Probability That Firm Has 
Industrial Relations Department 





% 


The mean response E{Yi) = as given by the response function is therefore simply 

/ the probability that >} = 1 when the level of the predictor variable is X,-. This interpretation 
of the mean response applies whether the response function is a simple linear one, as here, 
or a complex multiple regression one. The mean response, when the outcome variable is 
a 0, 1 indicator variable, always represents the probability that F = 1 for the given levels 
of the predictor variables. Figure 14.1 illustrates a simple linear response function for an 
indicator outcome variable. Here, the indicator variable Y refers to whether or not a firm has 
an industrial relations department, and the predictor variable X is size of firm. The response 
function in Figure 14.1 shows the probability that firms of given size have an industrial 
relations department. 

Special Problems when Response Variable Is Binary 

Special problems arise, unfortunately, when the response variable is an indicator variable. 
We consider three of these now, using a simple linear regression model as an illustration. 

1. Nonnormal Error Terms. For a binary 0, 1 response variable, each error term 6^ 二 
Yj — (j6 0 + jSiXf) can take on only two values: 

When Y,- = 1; £- t — 1 — Pq — PiX,- (14.53) 

When Y, = 0: e,- 二 一 A) — PiX； (14.5b) 

Clearly, normal error regression model (2.1)，which assumes that the £ :/ are normally dis¬ 
tributed, is not appropriate. 

2. Nonconstant Error Variance. Another problem with the error terms is that they do 
not have equal variances when the response variable is an indicator variable. To see this, 
we shall obtain <t 2 {F, } for the simple linear regression model (141), utilizing (A.15): 

<y 2 {Yi} = E{{Yj — E{Yj)) 2 ) =v(l — JT/) 2 7Ti + (0 — jt,.) 2 (1 — 7T,-) 
or: 

a 2 {Y,} = Ml 一兀 /) 二 EiYi}) ( 14 . 6 ) 



558 Part Three Nonlinear Regression 

The variance of is the same as that of Y f because £/ = Y x — 7i\ and n\ is a constant: 

o-V/} = ^{\-71；) = (EiY-.m- (14.7) 

or: 

二 （ A) + AA)(1 — A) — AD (14.7a) 

Note from (14.7a) that cr 2 {ei} depends on X t . Hence, the error variances will differ at 
different levels of X„ and ordinary least squares will no longer be optimal. 

3. Constraints on Response Function. Since the response function represents probabil¬ 
ities when the outcome variable is a 0,1 indicator variable, the mean responses should be 
constrained as follows: 

0 < E{Y) = 71 < 1 (14.8) 

Many response functions do not automatically possess this constraint. A linear response 
function, for instance, may fall outside the constrain^ limits within the range of the predictor 
wiable in the scope of the model. 


FIGURE 14.2 Examples of Probit and Logistic Mean Response Functions. 

(a) Probi^ with = 0 (b) Probit, with jS* = —1 



x 


x 









Chapter 14 Logistic Regression, Poisson Regression, and Generalized Linear Models 559 


The difficulties created by the need for the restriction in (14.8) on the response function 
are the most serious. One could use weighted least squares to handle the problem of unequal 
error variances. In addition, with laige sample sizes the method of least squares provides 
estimators that are asymptotically normal under quite general conditions, even if the distri¬ 
bution of the error terms is far from normal. However, the constraint on the mean responses 
to fall between 0 and 1 frequently will rule out a linear response function. In the industrial 
relations department example, for instance, use of a linear response function subject to the 
constraints on the mean response might require a probability of 0 for the mean response for 
all small firms and a probability of 1 for the mean response for all large firms, as illustrated 
in Figure 14.1. Such a model would often be considered unreasonable. Instead, a model 
where the probabilities 0 and 1 are reached asymptotically, as illustrated by each of the 
S-shaped curves in Figure 14.2, would usually be more appropriate.. 


14.2 Sigmoidal Response Functions for Binajy Responses 

- -- - w - 

In this section, we introduce three response functions for modeling binary responses. These 
functions are bounded between 0 and 1, have a characteristic sigmoidal- or 5-shape, and 
〆 approach 0 and 1 asymptotically. These functions arise naturally when the binary response 
variable results from a zero-one recoding (or dichotomization) of an underlying continuous 
response variable, and they are often appropriate for discrete binary responses as well. 


Probit Mean Response Function 

Consider a health researcher studying the effect of a mother’s use of alcohol (X — an index 
of degree of alcohol use during pregnancy) on the duration of her pregnancy XF C ). Here 
we use the superscript c to emphasize that the response variable, pregnancy duration, is a 
continuous response. This can be represented by a simple linear regression model; 


Y 卜埒 + + ^ 


(14.9) 


and we will assume that £ c { is normally distributed with mean zero and variance g). 

If the continuous response variable, pregnancy duration, were available, we might pro¬ 
ceed with the usual simple linear regression analysis. However, in this instance, researchers 
coded each pregnancy duration as preterm or full term using the following rule: 


f 1 if Yf < 38 weeks (preterm) 
[0 if Yf > 38 weeks (full term) 


It follows from (14.3) and (14.9) that: 


P(Y ； = 1) = 7T f - 


=Pft c <38) 

(14.10a) 

=+ + <3S) 

(14.10b) 


(14.10c) 

<J C J 

(14.10d) 


(14.10e) 


560 Part Three Nonlinear Regression 


where = (38 - /^' = —/3\/a r , and Z = </ov follows a standard normal 

distribution. If we let P(Z < ； )= 中 （ s)，we have, from (14.1 Oa-e): 

P(K = 1)=中(尻 + 尻入,） (14.11) 


Equations (14.3) and (14.11) together yield the nonlinear regression function known^ 
the probit mean response function: 

£{K ( }=tt, =0(^ + /^X,) (14.12) 

The inverse function, O 1 . of the standard normal cumulative distribution function 0 
is sometimes called the probit transformation. We solve for the linear predictor, ^ + 
in (14.12) by applying the probit transformation to both sides of the expression, obtaining. 

CD- 1 (71,) = 71; 二沉 + 代； 0 (14.1B) 

The resulting expression, 7r/ = is called the probit response function, or more 

generally, the linear predictor. 

Plots of the probit mean response function (14.12) for various values of /贫 and j6* are 
shown in Figures 14.2a and 14.2b. Some characteristics of this response function are ： 


1. The probit mean response function is bounded between 0 and 1, and it approaches these 
limits asymptotically. 

2. As increases (for ^ > 0), the mean function becomes more 5-shaped, changing 
more rapidly in the center. Figure 14.2a shows two probit mean response functions, 
where both intercept coefficients are 0, and the slope coefficients are 1 and 5. Notice that 
the curve has a more pronounced 5-shape with = 5, 

3. Changing the sign of from positive to negative changes the mean response function 
from a monotone increasing function to a monotone decreasing function. The probit 
mean response functions plotted in Figure 14.2a have positive slope coefficients while 
those in Figure 14.2b have negative slope coefficients. 

4. Increasing or decreasing the intercept ^ shifts the mean response function horizontally. 
(The direction of the shift depends on the signs of both /9 q and ^*.) Figure 14.2b shows 
two probit mean response functions, where both slope coefficients are — 1, and the 
intercept coefficients are 0 and 5. Notice that the curve has shifted to the right as ^ 
changes from 0 to 5. 

5. Finally, we note the following symmetiy property of the probit response function. If the 

response variable is recoded using \ — Y h that is, by changing the Is to Os and 

the Os to Is — the signs of all of the coefficients are reversed. This follows easily from 
the symmetry of the standard normal distribution: since O(Z) = 1 — <t>(—Z), it follows 
that P(Y; = 1)= P(^- = 0)= 1 - 0(^ + /3 ； X,) = 中 ( 一尻 一 /3 ； X；). 


Logistic Mean Response Function 

We have seen that the assumption of normally distributed errors for the underlying contiru- 
ous response variable in (14.9) led to the use of the standard normal cumulative distribution 
function, O, to model n；. An alternative error distribution that is very similar to the normal 
• distribution is the logistic distribution. Figure 14.3 presents plots of the standard normal 

density function and the logistic density function, each with mean zero and variance one. 
The plots are nearly indistinguishable, although the logistic distribution has slightly heavier 



Chapter 14 Logistic Regression, Poisson Regression, and Generalized Linear Models 561 


flGURE 143 
plots of Nonnal 
Pensity 

(dashed line) 

and Logistic 

pensity (solid 

tine)，Each 

jjavii^MeanO 

and Variance 1. 


tails. The density of a logistic random variable e L having mean zero and standara deviation 
o = 丌 /v^ has a simple form: - 

exp ⑹ 


/l(£l) 

[ts cumulative distribution function is: 


[1 + exp(e L )] 2 


(14.14a) 


F l (£l)= 


exp ⑹ 

I + exp ⑹ 


(14.14b) 


Suppose now that e c ( in (14.9) has a logistic distribution with mean zero and standard 
deviation c c . Then, from (14.1 Od) we have: 一 

尸⑺ = = P (咬 5 尻 + A %- 

where £^/cr c follows a logistic distribution with mean zero and standard deviation one. 
Multiplying both sides of the inequality inside the probability statement on the right by 
jt/V 3 does not change the probability; therefore: 


P(Yi = l) = ni 


= p ( t ^ 


£C i ^ 71 沪丄 77 浐 y 

vf ^ 0 + ^ lX/ 


— PifiL 5 A) + p'Xi) 
— Fl Wq + fiiXi) 

= ex P(A) + 灼足 -) 

1 + exp(^o + Pi^i) 


(14.15a) 

(14.15b) 

(14.15c) 

(14.15d) 


where —( 丌 /v ^) 你 and (?r/V^)/0* denote the logistic regression parameters. To 

summarize, the logistic mean response function is: 




expose + 3iXi) 


+ ex P(A) + PiXi) 

r 

Straightforward algebra shows that an equivalent form of (14.16) is given by: 


(14.16) 


E{Yi)=z ^ = [1 + exp(—j0 o — PiXi)] 


-l 


(14.17) 




562 Part Three Nonlinear Regression 


FIGURE 14.4 
Plots of 
Gumbel 
(dashed line), 
Normal (black 
line), and 
Logistic (gray 
line) Density 
Functions, 
Each Having 
Mean 0 and 
Variance 1 . 


Applying the inverse of the cumulative distribution function Fl to the two middle terms i n 
(14.16) yields: 


=A) + piXi 


(14.18) 


The transformation is called the logit transformation of the probability 巧， 加 d is 

given by: 






冗 i 


(14.18a) 


where the ratio 兀,./(1 一 777 ) in (14.18a) is called the odds. The linear predictor in (14.18) is 
referred to as the logit response function. 

Figures 14.2c and 14.2d each show two logistic mean response functions, where the 
parameters correspond to those in Figures 14.2a and 14.2b for the probit mean response 
function. It is clear from the plots that these logistic mean response functions are qualitatively 
similar to the corresponding probit mean response functions. The five properties of the probit 
mean response function, listed earlier, are also true f^r the logistic mean response function. 
The observed differences in logistic and probit mean response functions are largely due 
to the differences in the scaling of the parameters mentioned previously. Note that the 
symmetry property for the probit mean response function also holds for the logistic mean 
response function. 

Complementary Log-Log Response Function 

A third mean response function is sometimes used when the error distribution of e c is 
not symmetric. The density function /g(£) of the extreme value or Gumbel probability 
distribution having mean zero and variance one is shown in Figure 14.4, along with the 
comparable standard normal and logistic densities discussed earlier. Notice that this density 
is skewed to the right and clearly distinct from the standard normal and logistic densities. 
It can be shown that use of the Gumbel error distribution for e c in (14.9) leads to the mean 
response function: 


7Zi~\- exp (-exp + /3fXi)) 


(14.19) 




Chapter 14 Logistic Regression, Poisson Regression^ and Generalized Linear Models 563 


Solving for the linear predictor ^ we obtain the complementary log-log response 

model: 

= log [— log(l — jt(X ( -))] = Po + pfXi (14.19a) 

The symmetry property discussed on page 560 for the logit and probit models does not hold 
for (14.19). 

For the remainder of this chapter, we focus on the use of the logistic mean response 
function. This is currently the most widely used model for two reasons: (1) we shall 
see that the regression parameters have relatively simple and useful interpretations, and 
(2) statistical software is widely available for analysis of logistic regression models. In the 
next two sections we consider in detail the fitting of simple and multiple logistic regression 
models to binary data. 

Comment f 

h 

Our development of the logistic and probit mean response functipns assumed that the binary response 
Y { was obtained from an explicit dichotomization of an observed continuous response Y[, but this is 
not required. These response functions often work well for binary responses that do not arise from 
such a dichotomization. In addition, binary responses frequently can be interpreted as having arisen 
〆 from a dichotomization of an unobserved, or latent, continuous response. ■ 


14.3 Simple Logistic Regression 


We shall use the method of maximum likelihood to estimate the parameters of the logistic 
response function. This method is well suited to deal with the problems associated with the 
responses F ； being binary. As explained in Section 1.8, we first need to* develop the joint 
probability function of the sample observations. Instead of using the normal distribution 
for the Y observations as was done earlier in (1.26), we now need to utilize the Bernoulli 
distribution for a binary random variable. 


Simple Logistic Regression Model 

First, we require a formal statement of the simple logistic regression model. Recall that 
when the response variable is binary, taking on the values 1 and 0 with probabilities n and 
1 — jt, respectively, F is a Bernoulli random variable with parameter E{Y} = jt. We could 
state the simple logistic regression model in the usual form: 

Yi = E{Yi) + 

Since the distribution of the error term e t depends on the Bernoulli distribution of the 
response it is preferable to state the simple logistic regression model in the following 
fashion: 


Yi are independent Bernoulli random variables with expected 
values E{Yi} = tt ( , where: * 

*exp(^ 0 + PiXi) " 


E{Y i } = n i = 


+ k exp(/0Q + AD 


(14.20) 


The X observations are assumed to be known constants. Alternatively, if the X observations 
are random, E{Yi) is viewed as a conditional mean, given the value of X t . 



564 Part Three Nonlinear Regirssion 


Likelihood Function 

Since each Y ； observation is an ordinary Bernoulli random variable, where: 

P(K, — 1) = 7T; 

P{Yi = 0) = 1 ~ TT; 

we can represent its probability distribution as follows: 

fiiVi) = ^ K， (l - Ti;、'— Yi R 二 0, I; / = 1 ， ... ， 《 (14.21) 

Note that /,(1) — n ； and //(0) = 1 — n；. Hence, f ； (Y；) simply represents the probability 
that Yi — 1 orO. 

Since the F, observations are independent, their joint probability function is: 

n n 

k ,) 二 n 力 （ n)=n^ ,(i (14.22) 

i^\ f=i 

•乇 

Again, it will be easier to find the maximum likelihood estimates by working with the 
logarithm of the joint probability function: " 

rf 

Iog e g(F,, ... ， F„) = Iog c J|7T^(l - 7li) l ~ Yi 

n 

- !og e 7T ( - + (1 - F,-) log t XI - n；)] 
i=\ 

ll r / \ n H 

y ' 1 o ^[yz~j + E Io ^( l ~^> o 4 . 23 ) 

；=i l \ n i / j ，■- 1 

Since E{Y；} = n ； for a binary variable, it follows from (1416) that: 

1 - jr f = [t + exp(/? 0 + A^)]— 1 (14.24) 

Furthermore, from (14.18a), we obtain: 

log f (y^:) = Aj + (14.25) 

Hence, (14.23) can be expressed as follows: 

ii n 

loge L(fkh A) — + ex P(A> + A A)] (14.26) 

where L(^ Gi /3^) replaces g(K|,..., Y n ) to show explicitly that we now view this function 
as the likelihood function of the parameters to be estimated, given the sample observations. 

Maximum Likelihood Estimation 

The maximum likelihood estimates of /3q and in the simple logistic regression model 
are those values of /3q and 爲 that maximize the log-likelihood function in (14.26). No 
closed-form solution exists for the values of and Bs in (14.26) that maximize the log- 
likelihood function. Computer-intensive numerical search procedures are therefore required 



Chapter 14 Logistic Regression, Poisson Regression, and Generalized Linear Models 565 


to find the maximum likelihood estimates 万 0 and • There are several widely used numerical 
search procedures; one of these employs iteratively reweighted least squares, which we shall 
explain in Section 14.4. Reference 14.1 provides a discussion of several numerical search 
procedures for finding maximum likelihood estimates. We shall rely on standard statistical 
software programs specifically designed for logistic regression to obtain the maximum 
likelihood estimates Z?o and b x . 

Once the maximum likelihood estimates Z? 0 and are found, we substitute these values 
into the response function in (14.20) to obtain the fitted response function. We shall use 7T,- 
to denote the fitted value for the ith case: 

允 =exp(Z? 0 + ^t^f) 

1 1 +exp(Z? 0 +Z?iX f ) 

The fitted logistic response function is as follows: 

允一 exp(Z? 0 + ^i^) 

1 + exp(Z?o + 

If we utilize the logit transformation in (14.18), we can express the fitted response 
function in (14.28) as follows: 


(14.27) 

^14.28) 


rt^bo + ^X (14.29) 

where: 

允 ' = ¥( 占 ) 04.29a) 

We call (14.29) the fitted logit response function. 

Once the fitted logistic response function has been obtained, the usual next steps are to 
examine the appropriateness of the fitted response function and, if the fit is good, to make a 
variety of inferences and predictions. We shall postpone a discussion of how to examine the 
goodness of fit of a logistic response function and how to make inferences and predictions 
until we have considered the multiple logistic regression model with a number of predictor 
variables. 


A systems analyst studied the effect of computer programming experience on ability to 
complete within a specified time a complex programming task, including debugging. 
Twenty-five persons were selected for the study. They had varying amounts of programming 
experience (measured in months of experience), as shown in Table 14.1a, column 1. All 
persons were given the same programming task, and the results of their success in the task 
are shown in column 2. The results are coded in binary fashion: F = 1 if the task was com¬ 
pleted successfully in the allotted time, and F = 0 if the task was not completed successfully. 
Figure 14.5 contains a scatter plot of the data. This plot is not too informative because of the 
nature of the response variable, other than to indicate that ability to complete the task suc¬ 
cessfully appears to increase with amount of experience. A lowess nonparametric response 
curve was fitted to the data and is also shown in Figure 14.5. Asigmoidal S-shaped response 
function is clearly suggested by the nonparametric lowess fit. It was therefore decided to fit 
the logistic regression model (14.20). ^ 

A standard logistic regression package was run on the data. The results are contained 
in Table 14.1b. Since b 0 = —3.0597 and b\ = .1615, the estimated logistic regression 



FIGURE 14.5 
Scatter Plot, 
Lowess Curve 
(dashed line), 
and Estimated 
Logistic Mean 
Response 
Function 
(solid line)—- 
Programming 
Task Example. 


• • • • 9 9 



function (14.28) is: 

A _ exp(—3.0597 + .1 
71 1 + exp(—3.0597 + 

The fitted values are given in Table 14.1a, column 
response for / = 1， where X\ — 14, is: 

贵一 exp[-3.0597 + .1615 
711 1 + exp[—3.0597 + .16 


Part Three Nonlinear Regression 


TABLE 14.1 
Data end 
Maximum 
Likelihood 
Estimates — 
Programming 
Task Example. 


(a) Data 



(1) 

(2) 

(3) 


Months of 

Task 

Fitted 

irson 

Experience 

Success 

Value 

* 

i 

x t 

Y! 


i 

14 

0 

.310 

2 

29 

0 

.835 

3 

6 

0 

.110 

# • • 

23 

28 

1 

• * * 、 

,812 

24 

22 

1 

.621 

25 

8 

1 

.146 


(b) Maximum likelihood Estimates 


Regression 

Coefficient 

Pq 

A 


Estimated 

Regression 

Coefficient 

-3.0597 

.1615 


Estimated^ 

Standard 

Deviation 

1.259 

.0650 


0 5 0 

1 - 0 - 0 - 

>lseH u 一 ssuns 





Chapter 14 Logistic Regression, Poisson Regression, and Generalized Linear Models 567 


This fitted value is the estimated probability that a person with 14 months experience will 
successfully complete the programming task. In addition to the lowess fit，Figure 14.5 also 
contains a plot of the fitted logistic response function, n(x). 


Interpretation of bi 

The interpretation of the estimated regression coefficient bi in the fitted logistic response 
function (14.30) is not the straightforward interpretation of the slope in a linear regression 
model. The reason is that the effect of a unit increase in X varies for the logistic regression 
model according to the location of the starting point on the X scale. An interpretation of b y 
is found in the property of the fitted logistic function that the estimated odds ir/(l — ir) are 
multiplied by exp(Z? t ) for any unit increase in X. 

To see this, we consider the value of the fitted logit response function (14.29) atX = Xj ： 

7t f (Xj)=b Q -^b l Xj 

h 

The notation 7t f (Xj) indicates specifically the X level'associated with the fitted value. We 
also consider the value of the fitted logit response function atX = Xj + 1: 


〆 


Biple 


^ f (X I +l)=b 0 -{-b l (Xj-{-l) 
The difference between the two fitted values is simply: 


n f (Xj + 1) — 7t\Xj) = b x 


Now according to (14.29a), n f (Xj) is the logarithm of the estimated odds when X = Xj ； 
we shall denote it by log e (oddsi). Similarly, + 1) is the logarithm of the estimated 
odds when X = Xy + 1 ； we shall denote it by log e (odds 2 ). Hence, the difference between 
the two fitted logit response values can be expressed as follows: 


odds2 、 

log e (odds 2 ) - log^oddsO = log e ( ) = h 


Taking antilogs of each side, we see that the estimated ratio of the odds, called the odds 
ratio and denoted by OR, equals exp^): 

5^ = = expCZ^O (14.31) 

odds! 

For the programming task example, we see from Figure 14.5 that the probability of success 
increases sharply with experience. Specifically, Table 14.1b shows that the odds ratio is 
OR = exp(Z? t ) = exp(.1615) = 1.175, so that the odds of completing the task increase by 
17.5 percent with each additional month of experience. 

Since a unit increase of one month is quite small, the estimated odds ratio of 1.175 may not 
adequately show the change in odds fora longer difference in time. In general, the estimated 
odds ratio when there is a difference of c units of X is exp(cZ?^). For example, should we wish 
to compare individuals with relatively little experience to those with extensive experience, 
say 10 months versus 25 months so that c = 15, then the odds ratio would be estimated 
to be exp[15(.1615)] = 11.3. This indicates that the odds of completing the task increase 
over 11-fold for experienced persons compared to relatively inexperienced persons. 



FIGURE 14.6 
Logistic (solid 
line), Probit 
(dashed line), 
and Comple¬ 
mentary 
Log-Log (gray 
line) Fits — 
Programming 
Task Example, 





Experience (X) 


Comment 

The odds ratio interpretation of the estimated regressibn coefficient h\ makes the logistic regression 
model e^speciaHy atlmctive for modeling and interpreting epidemiologic studies. ■ 

Use of Probit and Complementary Log-Log Response Functions 

As we discussed earlier in Section 14.2, alternative sigmoidal shaped response functions, 
such as the pro bit 01 . complementary log-log functions, can be utilized as well. For example, 
it is interesting to fit the programming task data in Table 14.1 to these alternative response 
functions. Figure 14.6 shows the scatter plot of the data and the fitted logistic, probit, 
tind complementary log-log mean response functions. The logistic and probit fits are very 
similar, whereas the complementary log-log fit differs slightly, having a less pronounced 
S-sliape. 

Repeat Observations — Binomial Outcomes 

In some cases, particularly for designed experiments, a number of repeat observations are 
obtained at several levels of the predictor variable X. For instance, a pricing experiment 
involved showing a new product to 1,000 consumers, providing information about it, and 
then asking each consumer whether he or she would buy the product at a given price. 
Five prices were studied, and 200 persons were randomly selected for euch price level. 
The response variable here is binary (would purchase, would not purchase); the predictor 
variable is price and has five levels. 

When repeat observations are present, the log-likelihood function in (14.26) can be 
simplified. We shall adopt the notation used for replicate observations in our discussion of 
the F test for lack of fit in Section 3.7. We denote the X levels at which repeat observations are 
obtained by X j,. .. X r and we assume that there are n , binary respon.se.s at level X；. Then 
the observed value of the i th binary response at X } is denoted by K", where / 二 1 ， ■ - ■ ， n j ， 
and j = 1. c. The number of Is at level X } is denoted by K；: 

= V Y h (14.32a) 


568 Part Three Nonlinear Rc^n j ssion 


0 5 0 

••• 

loo 

9 -"$£ 



Chapter 14 Logistic Regression, Poisson Regression, and Generalized Linear Models 569 


Example 


and the proportion of Is at level Xj is denoted by pj： 


Pj 


— Li 


n j 


(14.32b) 


The random variable Yj has a binomial distribution given by: 

/(O (?:) 兀 兀， 厂〜 (14.33) 

where: 

{ ， 

f n J'\ = n J l I 

UJ _ - Yj)\ 

. • • • h 

and the factorial notation a ! represents a{a — — • • • 1. The binomial random variable 

Yj has mean njTZj and variance nj7ij{l — tzj). The log-likelihood function then can be 

stated as follows: 


log e L(p 0 , Pi)= 


log e 


n j 

Yj 


Y.jiPo + Pi^j) ~ n j log e [l + exp(^ 0 + 

(14.34) 


In a study of the effectiveness of coupons offering a price reduction on a given product, 
1,000 homes were selected at random. A packet containing advertising material and a 
coupon for the product were mailed to each home. The coupons offered different price 
reductions (5, 10, 15, 20, and 30 dollars), and 200 homes were assigned at random to each 
of the price reduction categories. The predictor variable X in this study is the amount of 
price reduction, and the response variable F is a binary variable indicating whether or not 
the coupon was redeemed within a six-month period 

Table 14.2 contains the data for this study. Xj denotes the price reduction offered by 
a coupon, nj the number of households that received a coupon with price reduction Xj, 
Yj the. number of these households that redeemed the coupon, and pj the proportion of 
households receiving a coupon with price reduction Xj that redeemed the coupon. The 
logistic regression model (1420) was fitted by a logistic regression package and the fitted 


WLE 14.2 

D£[(a-T-Coiq)on 
辦蝴 veness 

^mple. 




(2) 

(5) 

Number of 

m 

Proportion of 

(5) 

Model- 


Price 

Number of " 

Coupons 

Coupons 

Based 

level 

Reduction 

Households 

Redeemed 

RcldeerrijEd 

Estimate 

• 

J 

入 / 

n l 


Pi 


1 

S 

200 

30 

* .150 

.1736 

2 

10 

200 " 

55 

.275 

.2543 

3 

15 

200 、 

70 

.350 

.3562 

4 

20 

200 

1G0 


.4731 

5 

30 

200 

137 

v685 

.7028 




570 Part Three Nmiiinmr Regression 


FIGURE 14.7 
Plot of 
Proportions 
of Coupons 
Redeemed and 
Fitted Logistic 
Response 
Function — 
Coupon 
Effectiveness 
Example. 



Price Reduction ($) 


response function was found to be: 


ft — 


exp(—2.04435 + .096834X) 

1 + exp(—2.04435 + .096834 X) 


(14.35) 


Fitted values are given in column 5 of Table 14.2. Figure 14.7 shows the fitted response 
function, as well as the proportions of coupons redeemed at each of the Xj levels. The 
logistic response function appears to provide a very good fit. The odds ratio here is: 

OR = exp(/?i) = exp(.096834) = L102 

Hence, the odds of a coupon being redeemed are estimated to increase by 10.2 percent with 
each one dollar increase in the coupon value, that is, with each one dollar reduction in price. 


14.4 Multiple Logistic* Regression 


Multiple Logistic Regression Model 

The simple logistic regression model (14.20) is easily extended to more than one predictor 
variable. In fact, several predictor variables are usually required with logistic regression to 
obtain adequate description and useful predictions. 

In extending the simple logistic regression model, we simply replace jSo + jSi X in(14.t6) 
by + ^\Xi + … + To simplify the formulas, we shall use matrix notati ⑶ 

and the followins three vectors: 





We then have: 


Chapter 14 Logistic Regression, Poisson Regression，and Generalized Linear Models 571 


= 仇 + 队 + ••■ + Uh (14.37a) 

= A) + + … + Pp_iX itP —i (14.37b) 


With this notation, the simple logistic response function (14.20) extends to the multiple 
logistic response function as follows: 


E{Y} = 


exp(X'p) 




h 


1 + exp ⑽） 

and the equivalent simple logistic response form (14.17) extends to: 

£{F} = [l+exp(-X / p)r 1 

i 

Similarly, the logit transformation (14.18a): 

now leads to the logit response function, or linear predictor: 

7l' = 

The multiple logistic regression model can therefore be stated as follows: 


(14.38) 

(14.38a) 

(14.39) 

(14.40) 


Yi are independent Bernoulli random variables with expected 
values E{Yi \ = tzi, where: 

E{Yi} = m = j 

Again, the X observations are considered to be known constants. Alternatively, if the X vari¬ 
ables are random, } is viewed as a conditional mean, given the values of X- t t , …, 

Like the simple logistic response function (14.16), the multiple logistic response func¬ 
tion (14.41) is monotonic and sigmoidal in shape with respect to X’P and is almost linear 
when 7z is between .2 and .8. The X variables may be different predictor variables, or 
some may represent curvature and/or interaction effects. Also, the predictor variables may 
be quantitative, or they may be qualitative and represented by indicator variables. This 
flexibility makes the multiple logistic regression model veiy attractive. 


exp(X ； P) 

+ exp(X^) 


, Comment 

When the logistic regression model contains only qualitative variables, it is often referred to as 
: : a log-linear model. See Reference 14.2 for an in-depth discussion of the analysis of log-linear 

" models. ■ 

離 g of Model — 

j Again, we shall utilize the method of maximum likelihood to estimate the parameters of the 

勺 multiple logistic response function (14.41). The log-likelihood function for simple logistic 

regression in (14.26) extends directly for multiple logistic regression: 

V 

n n 

) log e / ， 0) = — X>g e [l + exp(X ； p)] (14.42) 



572 Part Three Nonlinear Retiivs.sioii 


FIGURE 14.8 
Three- 
Dimensional 
Fitted Logistic 
Response 
Surface — 
Coronaty 
Heart Disease 
Example. 


Numerical search procedures lire used to find the values ot 卢 0 , ,.... that maximize 

log c , 乙 （ P). These maximum likelihood estimates will be denoted by b () , b 、， ... 、 b 卜 ^ Letb 
denote the vector of the maximum likelihood estiniiites: 


(14.43) 


The fitted logistic response function and fitted values can then be expressed as follows: 

^ exp(X’b) .一一 , . 


where: 


开二 r^r n+exp (- x,b)r 


x , b = 6 0 + / ?I x I + --- + v 1 x / ;_ 1 
X|b = b{) + b\ Xj I + ■ ■ ■ + bp- \ Xj p^\ 


( 14 . 44 ^) 

(14.44b) 

(14.44c) 

(14.44d) 


Geometric interpretation. Recall that when fitting a standard multiple regression model 
with two predictors, the estimated regression surface is a plane in three-dimensional space, 
as shown in Figure 6.7 on page 240 for the Dwaine Studios example. A multiple logistic 
regression fit based on two continuous predictors can also be represented by a surface in 
three-dimensional space, but the surface follows the characteristic S-shape that we saw 
for simple logistic models. For example. Figure 14.8 displays a three-dimensional plot of a 
logistic response function that depicts the relationship between the development of coronary 
disease (K, the binary outcome) and two continuous predictors, cholesterol level (X|) and 
age (Xt). This surface increases in an approximately linear fashion for larger values of 




Chapter 14 Logistic Regression, Poisson Regression, and Generalized Linear Models 573 


cholesterol level and age, but levels off and is nearly horizontal for small values of these 
predictors. 

We shall rely on standard statistical packages for logistic regression to conduct the 
numerical search procedures for obtaining the maximum likelihood estimates. We therefore 
proceed directly to an example to illustrate the fitting and interpretation of a multiple logistic 
regression model. 

In a health study to investigate an epidemic outbreak of a disease that is spread by mosquitoes, 
individuals were randomly sampled within two sectors in a city to determine if the person 
had recently contracted the disease under study. This was ascertained by the interviewer, 
who asked pertinent questions to assess whether certain specific symptoms associated with 
the disease were present during the specified period. The response variable Y was coded 1 
if this disease was determined to have been present, and 0 if not. 

Three predictor variables were included in the study, representing known or potential 
risk factors. They are age, socioeconomic status of household, and sector within city. Age 
(X!) is a quantitative variable. Socioeconomic status is a categorical variable with three 
levels. It is represented by two indicator variables (X 2 and X 3 ), as follows: 


Class 

入 2 

太 3 

Upper 

0 

0 

Middle 

i 

0 

Lower 

0 

1 


City sector is also a categorical variable. Since there were only two sectors in the study, 
one indicator variable (X 4 ) was used, defined so that X 4 二 0 for sector 1 and X 4 = l for 
sector 2 . 

The reason why the upper socioeconomic class was chosen as the reference class 
(i.e., the class for which the indicator variables X 2 and X 3 are coded 0 ) is that it was expected 
that this class would have the lowest disease rate among the socioeconomic classes. By mak¬ 
ing this class the reference class, the odds ratios associated with regression coefficients ^ 
and ^3 would then be expected to be greater than 1, facilitating their interpretation. For the 
same reason, sector 1 , where the epidemic was less severe, was chosen as the reference 
class for the sector indicator variable X 4 . 

The data for 196 individuals in the sample are given in the disease outbreak data set in 
Appendix C.10. The first 98 cases were selected for fitting the model. The remaining 98 
cases were saved to serve as a validation data set. Table 14.3 in columns 1—5 contains the 
data for a portion of the 98 cases used for fitting the model. Note the use of the indicator 
variables as just explained for the two categorical variables. The primary purpose of the 
study was to assess the strength of the association between each of the predictor variables 
and the probability of a person having contracted the disease. - 

A first-order multiple logistic regression model with the three predictor variables was 
considered a priori to be reasonable: * 


五 {K} 二 [l+exp(—XT)]— 1 


(14.45) 



574 Part Three Nonlinear Regression 


3.8877 

.9955 

— 

.02975 

.01350 

1.030 

.4088 

.5990 

1.505 

-.30525 

.6041 

.737 

1.5747 

.5016 

4.829 


(b) Estimated Approximate Variance-Covariance Matrix 


(a) Estimated Coefficients, Standard Deviations, and Odds Ratios 


Regression 

Coefficient 


Estimated 

Regression 

Coefficient 


Estimated 

Standard 

Deviation 


Estimated 
Odds Ratio 


TABLE 14.4 

Maximum 

Likelihood 

Estimates 

of Logistic 

Regression 

Function 

(14.45)— 

Disease 

Outbreak 

Example. 


* 2 {b} 


^0 


t>2 

b 3 

b 4 

.4129 

-.0057 

-1836 

一 .2010 

-.1632 " 

.0057 

.00018 

.00115 

.00073 

.00034 

.1836 

.00115 

.3588 

.1482 

.0129 

.2010 

.00073 

.1482 

.3650 

.0623 

.1632 

.00034 

.0129 

.0623 

.2516 


where: 


X'P = + ^ 2^-2 + /?3^3 + 


(14.45a) 


This model was fitted by the method of maximum likelihood to the data for the 98 cases. 
The results are summarized in Table 14.4a. The estimated logistic response function is: 

7T = [1 + exp(3.8877 - .02975X! - .4088X 2 + .30525X 3 - L5747X 4 ] _1 O 4 , 46 ) 

The interpretation of the estimated regression coefficients in the fitted first-order multiple 
logistic response function parallels that for the simple logistic response function: exp(W 
is the estimated odds ratio for predictor variable X k . The only difference in interpretatKMJ 
for multiple logistic regression is that the estimated odds ratio for predictor variable & 


TABLE 14.3 
Portion of 
Model- 


0) 

(2) ⑶ 
Socioeconomic 
Status 

(4) 

City 

(5) 

Disease 

⑹ 

Fitted 

Building Data 

Case 

Age 

Sector 

Status 

Value 

Set — Disease 

i 

Xn 

Xiz 

Xi3 

^f/4 

Yi 

亓 i 

Outbreak 

1 

33 

0 

0 

0 

0 

•209 

Example. 

(Coded) 2 

35 

0 

0 

0 

0 

.219 


3 

6 

0 

0 

0 

0 

.106 

4 

60 

0 

0 

0 

0 

.371 

5 

18 

0 

1 

0 

1 

•111 

6 

26 

0 

1 

0 

0 

.136 

98 

35 

0 

T 

0 

o' 

.171 





Chapter 14 Logistic Regression, Poisson Regression, and Generalized Linear Models 575 


assumes that all other predictor variables are held constant. The levels at which they are 
held constant does not matter in a first-order model. We see from Table 14.4a, for instance, 
that the odds of a person having contracted the disease increase by about 3.0 percent with 
each additional year of age (Xi), for given socioeconomic status and city sector location. 
Also, the odds of a person in sector 2 (X 4 ) having contracted the disease are almost five 
times as great as for a person in sector 1, for given age and socioeconomic status. These are 
point estimates, to be sure, and we shall need to consider how precise these estimates are. 

Table 14.3, column 6 , contains the fitted values fti ，These are calculated as usual. For 
instance, the estimated mean response for case i — 1, where X u = 33, X 12 = 0, = 0, 

X 14 = 0 , is: 

7T, = {1 + exp[2.3129 - .02975(33) - .4088(0) + .30525(0) - 1.5747(0)] 厂 1 = .209 

Polynomial Logistic Regression 1 

Occasionally, the first-order logistic model may not provide an adequate fit to the data and 
a more complicated model may be needed. One such model is the 众 th-order polynomial 
logistic regression model with logit response function: 

7T’(x) = A) + p n x + ^ 2 2 x 2 + … + p kk x k (14.47) 

where x denotes the centered predictor, X — X. This model for the logit is still linear in the 
P parameters. For simplicity, we will use a second-order polynomial: 

n\x) = Po + Pux + Pnx 2 

to demonstrate the procedure. 

A study of 482 initial public offering companies (IPOs) was conducted to determine the 
characteristics of companies that attract venture capital. Here, the response of interest 
is whether or not the company was financed by venture capital funds. Several potential 
predictors are: the face value of the company; the number of shares offered; and whether 
or not the company was a leveraged buyout. The IPO data set is listed in Appendix C.ll. 
In this example we consider just one predictor, the face value of the company. 

Figure 14.9a contains a plot of venture capital involvement (K) versus the the natu¬ 
ral logarithm of the face value of the company (X) with a lowess smooth and the fitted 



0^IRE 14.9 (a) First-Order Fit (b) Second-Order Fit 

'liRit-and 



Logarithm of Face Value 


Logarithm of Face Value 



576 Part Three Nonlinear Regression 


TABLE 14.5 

L 吨 istic 
R^ression 
Output for 
Second-Order 
Model—IPO 
Example. 



Estimated 

Estimated 



Predictor 

Coefficient 

Standard Error 

z* 

P-value 

Constant 

b 0 = 0.3005 

0.1240 

2.42 

0.015 

X 

b u = 0.5516 

0.1385 

3.98 

0.000 

X 2 

bz2 — 一 0.8615 

0.1404 

-6.14 

0.000 


first-order logistic regression fit superimposed. (Here we chose to analyze the natural loc a . 
rithm of face value because face value ranges over several orders of magnitude, with a highly 
skewed distribution.) The lowers smooth clearly suggests a mound-.shaped relationship ： for 
small and large companies, the likelihood of venture capital involvement Ls near zero, but for 
midsized companies it is over .5. The first-order logistic regression fit is unable to capture 
the characteristic mound shape of the mean response function and is clearly inadequate 
Table 14.5 shows the fitted second-order response function: 


n = .3005 + .5516a — .8615.r 2 


where x = X — X. Also shown In Table 14.5 are three quantities to be discussed in Sec¬ 
tion 14.5, namely, the estimated standard error of each coefficient, a statistic, for testing 
the hypothesis that the coefficient is zero, and the resulting P-value. We simply note for now 
that the P-value for b 2 2 is .000, confirming the need for a second-order term. Figure 14.9b 
plots the data, the lowess smooth, and the second-order polynomial logistic regression fit. 
Note that the second-order polynomial fit tracks the lowess smooth closely. 

The above example demonstrated the use of polynomial regression for a single predictor. 
For multiple logistic regression, higher order polynomial terms and cross-products maybe 
added to improve the fit of a model, as discussed in Section 8.1 in the context of multiple 
linear regression models. 


Comments 

1. The maximum likelihood estimates of the parameters P for the logistic regression model can 
be obtained by iteratively rewei^hted least squares. The procedure is straightforward, although it 
involves intensive use of a computer. 

a. Obtain starting values for the regression parameters, to be denoted by b( 0 ). Often, reasonable 

starting vulues can be obtained by ordinary least squares regression of Y on the predictor variables : 
X\ .using a first-order lineiir model. 

b. Using these starting values, obtain: 


^■(0) - x ； fb(0)l 


亓 /(())= 


exp[# ； K))j 
I + exp ^,-(0)1 


c. Calculate the new response variable-. 


n-(0) = # ； (0) + 


Y ； 7T,(()) 

开 ,-(())[1 一开 ,(())1 


(14.48a); 
(14.48b) ^ 

(I4.49a) w 


二 ir ( (())U - #, (())] 


(14 ■ 轉 4 


and the weights: 


Chapter 14 Logistic Regression, Poisson Regression, and Generalized Linear Models 577 


d. Regress y'(0) in (14.49a) on the predictor variables X u X p _ x using a first-order linear 
model with weights in (14.49b) to obtain revised estimated regression coefficients, denoted by b(l). 

e. Repeat steps b through d, making revisions in (14,48) and (14.49) by using the latest revised 
estimated regression coefficients until there is little if any change in the estimated coefficients. Often 
three or four iterations are sufficient to obtain convergence. 

2. When the multiple logistic regression model is not a first-order model and contains quadratic 
or higher-power terms for the predictor variables and/or cross-product terms for interaction effects, 
the estimated regression coefficients b k no longer have a simple inteipretation, 

3. When the assumptions of a monotonic sigmoidal relation between n and X’P，required for 
the multiple logistic regression model, are not appropriate, an alternative is to convert all predictor 
variables to categorical variables and employ a log-linear model. In the disease outbreak example, 
for instance, age could be converted into a categorical variable with three classes 0-18, 19-50, and 
51-75 - Reference 14.2 describes the use of log-linear models for binary response variables when the 
predictor variables are categorical. 

4. Convergence difficulties in the numerical search procedures for finding the maximumjikelihood 

estimates of the multiple logistic regression function maybe encountered when the predictor variables 
are highly correlated or when there is a large number of predictor variables. Another instance that 
causes convergence problems occurs when a collection of the predictors either completely or nearly 
perfectly separates the outcome groups- Indication of this problem often can be detected by noting large 
estimated parameters and lai^e estimated standard errors, similar to what occurs with multicollinearity 
problems. When convergence problems occur, it maybe necessary to reduce the number of predictor 
variables in order to obtain convergence. ■ 


14.5 Inferences about Regression Parameters_ 

The same types of inferences are of interest in logistic regression as for linear regression 
models — inferences about the regression coefficients, estimation of mean responses, and 
predictions of new observations. 

The inference procedures that we shall present rely on large sample sizes. For large sam¬ 
ples, under generally applicable conditions, maximum likelihood estimators for logistic 
regression are approximately normally distributed, with little or no bias, and with approxi¬ 
mate variances and covariances that are functions of the second-order partial derivatives of 
the logarithm of the likelihood function. 

Specifically, let G denote the matrix of second-order partial derivatives of the log- 
弋、 likelihood function in (14.42), the derivatives being taken with regard to the parameters 

Po ， fii ，• " ， Pp—\ ■ 

1 G = [gij] / = 0, 1 ， ■•■，/? — 1; = 0, 1 ， ■..，/? —1 (14.50) 

where: , 

. a 2 iog e / ， 0) . 

_ a 2 iog e L(p) 

\ opoop\ 

etc. 



578 Part Three Nofirmear- Regnwuii 


This matrix is called the Hessian matrix. When the second-order partial derivatives in the 
Hessian matrix are evaluated at P = b, that is, at the maximum likelihood estimates, the 
estimated approximate variance-covariance matrix of the estimated regression coefficients 
for logistic regression can be obtained as follows ： 

s 2 {b}= ( 卜幻 yl^b)— 1 (14.51) 

The estimated approximate variances and covariances in (14.51) are routinely provided by 
most logistic regression computer packages. 

Inferences about the regression coefficients for the simple logistic regression model 
(14.20) or the multiple logistic regression model (14.41) are based on the following approx¬ 
imate result when the sample size is large: 

b ~^〜 Z 丨 04.52) 

where : is a standard normal random variable and s{b k } is the estimated approximate 
standard deviation of b k obtained from (14.51). 


Test Concerning a Single p k : Wald Test 

A large-sample test of a single regression parameter can be constructed based on (14.52). 
For the alternatives: 


an appropriate test statistic is: 


and the decision rule is: 


^4: A = 0 
A # 0 

〜 _A_ 
、一 


(14.53a) 


(14.53b) 


If |<,*| < c(l — a/2), conclude 
If > c(l — a/2), conclude H a 


(14.53c) 


One-sided alternatives will involve a one-sided decision rule. The testing procedure in 
(14.53) is commonly referred to as the Wald test. On occasion, the square off is used 
instead, and the test is then based on a chi-square distribution with 1 degree of freedom. 
This is also referred to as the Wald test. 


Example 


In the programming task example, was expected to be positive. The alternatives of interest 
therefore are: 


Hq ： j3i <0 

H a : ) 6 , > 0 

Test statistic (14.53b), using the results in Table 14. lb, is: 

..1615 


7 = 


.0650 


2.485 



Chapter 14 Logistic Regression, Poisson Regression, and Generalized Linear Models 579 


For a = .05, we require ^(.95) = 1.645. The decision rule therefore is: 

If^* < 1.645, conclude Hq 
Ifz* > 1.645, conclude H a 

Since = 2.485 > 1.645, we conclude H a , that ^ is positive, as expected. The one-sided 
P-value of this test is .0065. 

Interval Estimation of a Single fik 

From (14.52), we obtain directly the approximate 1 — cc confidence limits for 你： 

^±z(l— (14.54) 

where z(l — a/7) is the (1 — a/2)100 percentile of the standard normal distribution. 

The corresponding confidence limits for the odds ratio exp(^) are; - 

\ 

exp 叭 ± z(l - a/2)s{b k }] (14.55) 

For the programming task example, it is desired to estimate with an approximate 
95 percent confidence interval. We require z{.915) = 1.960, as well as the estimates 
b { = . 1615 and } = .0650 which are given in Table 14.1b. Hence, the confidence limits 
are .1615 士 1.960(.0650), and the approximate 95 percent confidence interval for 爲 is: 

.0341 < A < -2889 

Thus, we can conclude with approximately 95 percent confidence that is between 
.0341 and .2889. The corresponding 95 percent confidence limits for the odds ratio are 
exp(.034l) = 1.03 and exp(.2889) = 1.33. 

To examine whether the large-sample inference procedures are applicable here when 
n = 25, bootstrap sampling can be employed, as described in Chapter 13. Alternatively, 
estimation procedures have been developed for logistic regression that do not depend on any 
lai^e-sample approximations. LogXact (Reference 14.3) was run on the data and produced 
95 percent confidence limits for A of .041 and .296. The large-sample limits of .034 and 
.289 are reasonably close to the LogXact limits, confirming the applicability of large-sample 
theory here. 

If we wish to consider the odds ratio for persons whose experience differs by, say, five 
months, the point estimate of this odds ratio would be exp(5Z? t ) = exp[5(.I615)] = 2.242, 
and the 95 percent confidence limits would be obtained from the confidence limits for b y 
\ as follows: exp[5(.0341)] = 1.186 and exp[5(.2889)] = 4.240. Thus, with 95 percent confi- 

dence we estimate that the odds of success increase by between 19 percent and 324 percent 
with an additional five months of experience. — 

a 

j Comments 

1. If the large-sample conditions for inferences are not met, the bootstrap procedure can be em- 
ployed to obtain confidence limits for the regression coefficients. The bootstrap here requires gen¬ 
erating Bernoulli random variables as discussed in Section 14.8 for the construction of simulated 
envelopes. 

2. We are using the z approximation here for large-sample inferences rather than the? approxima- 

> tion used in Chapter 13 for nonlinear regression. This choice is conventional for logistic regression. 




580 Part Three Noiiliuear Re^rcwsmit 


For large sample sizes, there is little ditfcrcnce between the t distribution and the siandjrd normal 
distribution. 

3, Approximate joint confidence intervals for several logistic regression parameters cun be de 
veloped by the Bonl'crroni procedure. If ^ parameters are to be csiimyled with family confidence 
coefficient of approximately 1 - a, the joini Bonferroni confidence limits itre: 

/) A 土 丨 (14.56) 

where; 

^ - wAt;) (14.56a) 


4. For power -and sample size considerations in logistic regression modeling, see Reference 14 ^ 

■ 

Test whether Several )^ = 0: Likelihood Ratio Test 

Frequently there is interest in determining whether a subset of the X variables in a multiple 
logistic regression model can be dropped, that is, in testing whether the associated regression 
coefficients equal zero. The test procedure we shall employ is a general one for use with 
maximum likelihood estimation, and is anulogous to the general linear test procedure for 
linear models. The test is culled the likelihood ratio test, und, like the general linear test, is 
based on a comparison of full and reduced models. The test is valid for large sample sizes. 
We begin with the full logistic model with response function: 

n = {\ + exp(—X’p,.Or 1 Full model (14.57) 

where: 


X'P i ： — /^) -\~ I X i 

We then find the maximum likelihood estimates for the full model, now denoted by bp, 
and evaluate the likelihood function L(P) when P F = . We shall denote this value of the 

likelihood function for the full model by L(F), 

The hypothesis we wish to test is: 


A, = Pq+\ = ' •' = fip-i — 0 
H (l : not all of the p k in equal zero 


(14.58) 


where, for convenience, we arrange the model so thal the last /? — q coefficients m'e those 
tested. The reduced logistic model therefore has the response function: 


7 r = f 1 + exp(—Reduced model (14.59) 

where: 


X’Pk = A> + A + … + fiq - \^n-\ 

Now we obtain the maximum likelihood estimates for the reduced model and evaluate 
the likelihood function for the reduced model containing q parameters when 
We shall denote this value of the likelihood function for the reduced model by L(R )- ⑽抑 
be shown that L(R) cannot exceed L{F) since one cannot obtain a larger maximumfortl® 
likelihood function using a subset of the parameters. 



Chapter 14 Logistic Regression，Poisson Regressioth and Generalized Linear Models 581 


The actual test statistic for the likelihood ratio test, denoted by G 2 , is: 


G 2 = -2log, - -2[Iog e L(R)-log e L(F)] (14.60) 

Note that if the ratio L(R)/L(F) is small, indicating H a is the appropriate conclusion, then 
G 2 is large. Thus, large values of G 2 lead to conclusion H a . 

Large-sample theory states that when n is large, G 2 is distributed approximately as 
X 2 iP ~ R) when // 0 in (14.58) holds. The degrees of freedom correspond to df R — dfp ~ 
(n — q) — (n — p) — p ~ q. The appropriate decision rule therefore is: 


If G 2 < / 2 (1 — a\p — q), conclude H 0 
If G 2 > x 2 (l — cr, p — q), conclude H a 


\\ 

(14.61) 


In the disease outbreak example, the model building began with the three predictor variables 
that were considered a priori to be key explanatory variables — age, socioeconomic status, 
and city sector. A logistic regression model was fitted containing these three predictor 
variables and the log-likelihood for this model was obtained. Then tests were conducted to 
see whether a variable could be dropped from the model. First, age was dropped from 
the logistic model and the log-likelihood for this reduced model was obtained. The results 
were: 

L(F) = L(bo,b 1 ,b 2 ,b 3 ,b 4 ) = -50.527 L(R) = L(b 0 , b 2 , b 3 , b A ) = —53.102 
Hence the required test statistic is: 

G 2 = -2[log e L(i?) - log e L(F)] = -2[-53.102 - (-50.527)] = 5.150 

For a = .05, we require x 2 (.95; 1) = 3.84. Hence to test Ho： Pi =0, H a ： ^ 0, the ap¬ 
propriate decision rule is: 

If G 2 < 3.84, conclude H 0 
.If G 2 > 3.84, conclude H a 

Since G 2 = 5.15 > 3.84, we conclude H a , that X { should not be dropped from the model 
The P-value of this test is .023. 

Similar tests for socioeconomic status (X 2 , X 3 ) and city sector (X 4 ) led to 尸 -values of 
.55 and .001. The f-value for socioeconomic status suggests that it can be dropped from the 
model containing the other two predictor variables. However, since this variable was con¬ 
sidered a priori to be important, additional analyses were conducted. When socioeconomic 
status is the only predictor in the logistic regression model, the F-value for the test whether 
this predictor variable is helpful is .16, suggesting marginal importance for this variable. 
In addition, the estimated regression coefficients for age and city sector and their estimated 
standard deviations are not appreciably affected by whetheror not socioeconomic Status is 
in the regression model. Hence, it was decided to keep socioeconomic status in the logistic 
regression model in view of its o priori importance. 

The next question of concern was whether any twofactor interaction terms are required 
in the model. The full model now includes all possible two-factor interactions, in addition 



582 Part Three Nonlinear Rcgre.s.si(»t 


to the main effects, so that for this model is as follows: 

x'pf- = A ) + /H 十 A A 十 Ai 心 + ^4X4 + fi^X\ x 2 + 

+ /^7 XI J ^4 + 1% X 2 X 4 + fit) X 3 X 4 Ful] model 

We wish to test: 

W 0 : j 6 5 = = j6tj = 0 

H a \ not a]] f^ k in H 0 equal zero 
so that X'P K for the reduced model is: 

X'Pfi = A ) + 十佐入 2 + / U 3 十 Reduced model 

A computer run of a multiple logistic regression package yielded: 

L(F) = -46.998 知 

L(R) = -50.527 

G 2 = -2[log ( ,(/?) - log e (F)] = 7,058 

If H g holds, G 2 follows approximately the chi-square distribution with 5 degrees of freedom. 
For a = .05, we require / 2 (.95; 5) = 11.07. Since G 2 = 7.058 < 11.07, we conclude Ho, 
that the two-factor interactions are not needed in the logistic regression model. The P-value 
of this test is .22 - V/e note again that a logistic regression model without interaction terms 
is desirable, because otherwise exp ( 仇 ） no longer can be interpreted as the odds ratio. 

Thus, the fitted logistic regression model (14.46) was accepted as the model to be checked 
diagnostically and, finally, to be validated. 

Comment 

The Wald test for a single regression parameter in (14.53) is more versatile than the likelihcx)d rati) 
test in (14.60). The latter can only be used to test H^: ^ = 0, whereas the former can be used also for 
one-sided tests and for testing whether equals some specified value other than zero. When testing 
// ( ,； = (), the two tests are not identical and may occasionally lead to different conclusions. For 

example, the Wald test P-value for dropping age when socioeconomic status and sector are in the 
model for the disease data set example is .0275; the 尸 -value for the likelihood ratio test is .023. • 

14.6 Automatic Model Selection Methodic_ — 

Several automatic model selection methods are available for building logistic regressioi 
models. These include all-possible-regressions and stepwise procedures. We begin with a 
discussion of criteria for model selection. 

Model Selection Criteria 

In the context of multiple linear regression models, we discussed the use of the followin^ 
model selection criteria in Chapter 9: R 2 p , Rl p , C p , AfC p , SBC p , and PRESS r Forlogis^ 
regression modeling, the AJC P and SBC p criteria are easily adapted and are generaii 
available in commercial software. For these reasons we will focus on the use of these 1 . 



Chapter 14 Logistic Regression, Poisson Regression, and Generalized Linear Models 583 


criteria. The modifications are as follows: 

AIC p = -2\og e L(b) + 2/? (14.62) 

SBC P = —2 log e /.(b) + p\og e {n) (14.63) 

where log e /,(b) is the log-likelihood expression in (14.42). Promising models will yield 
relatively small values for these criteria. A third criterion that is frequently provided by 
software packages is —2 times the log-likelihood, or —21og e /.(b). For this criterion, we 
also seek models giving small values. A drawback of this third criterion is that —2 log e L(b) 
will never increase as terms are added to the model, because there is no penalty for adding 
predictors. This is analogous to the use of SSE p or R 2 p in multiple linear regression. It is 
easily seen from (14.62) t and (14.63) that and SBC p also involve — 21og e L(b)，but 
penalties are added based on the number of terms p. Hiis penalty is 2p for AIC P and 
p\og e {n) for SBC p . 

_r 

Best Subsets Procedures 

“Best” subsets procedures were discussed in Section 9.4 in the context of multiple linear 
regression. Recall that these procedures identify a group of subset models that give the 
best values of a specified criterion. As long as the number of parameters is not too large 
(typically less than 30 or 40) these procedures can be useful. As we noted in Section 9.4, 
time-saving algorithms have been developed that can identify the most promising models, 
without having to evaluate all TP 一 ' candidates. These procedures are similarly applicable in 
the context of logistic regression. We now illustrate the use of the the best subsets procedure 
based on the AIC p and SBC p criteria. 

For the disease outbreak example, there are four predictors, age (X,), socioeconomic sta¬ 
tus (X 2 and X 3 ) and city sector (X4). Normally, it is advantageous to tie the two indica¬ 
tors for the qualitative predictor socioeconomic status together; that is, a model should 
either have both predictors, or neither. Since very few statistical software packages fol¬ 
low this convention, we will allow them to be independently included. This leads to the 
2 4 = 16 possible regression models listed in columns 2-5 of Table 14.6a. The A/C p , SBC p , 
and —2 log e L(b) criterion values for each of the 16 models are listed in columns 6-8 of 
Table 14.6a and are plotted against p in Figures 14.10a-c, respectively. 

As shown in Figures 14.10a and 14.10b, both A/Cp and SBC p are minimized for /? = 3. 
Inspection of Table 14.6b reveals that the best two-predictor model for both criteria is based 
on Xi (age) and X 4 (city sector). Other models that appear promising on the basis of the 
AIC p criterion are the three-predictor subsets based on Xi, X 2 , and X 4 and X { , X 3 , and X 4 , 
and the full model based on all four predictors. SBC p also identifies the two three-predictor 
subset models just noted, as well as the one-predictor model based on X 4 . The tendency of 
SBC p to favor smaller models is evident in this example. 

The plot of —2 log e L(b) in Figure 14.10c also points to a two- or three-predictor subset. 
The additional reduction in —2 log e /.(b) from moving from the best two-predictor model 
to the best three-predictor model are small, and the returns continue to diminish as we move 
from three predictors to the full, four-predictor model. 

wi 誇 Model Selection 

; As we noted in Chapter 9 in the context of model selection for multiple linear regression, 

when the number of predictors is large (i.e., 40 or more) the use of all-possible-regression 


gxample 



584 Part Three Nonlinear Regression 


(3) ⑷ 

Socioeconomic 
Status 


(5) 

⑹ 

(7) 

(8) 

City 

Sector 

AIC p 

SBC p 

— 2I og e t(b) 

0 

124.318 

126.903 

122.318 

0 

H 8.913 

124.083 

114.91 B 

0 

124.882 

130.052 

120.882 

0 

122.229 

127.399 

118.229 

1 

m.534 

116.704 

107.534 

0 

119.109 

126.864 

113,109 

0 

117.968 

125.723 

111.968 

1 

108.259 

n 6 . 01 4 

102.259 

0 

124.085 

131.840 

118.085 

1 

112.881 

120.636 

106.881 

1 

112.371 

,120.126 

106371 

0 

119.502 

129.842 

111.502 

1 

109.310 

119.650 

101.310 

1 

109.521 

119.861 

101.521 

1 

114.204 

124.543 

106.204 


川 .054 

123.979 

101.054 


( 2 ) 

Age 

0 

1 

0 

0 

0 

1 

1 

1 

0 

0 

0 

1 

1 

1 

0 

1 


(b) Best Four Models for Each Criterion 



AIC P Criterion 


SBCp Criterion 


Rank 

Predictors 

AIC P 

Predictors 

5BC p 

1 

入 1 , 入 4 

108.259 

x 1# X 4 

116.014 

2 

Oh 入 4 

109.310 

X 4 

116.704 

3 

入 1 / 入 3/ 入 4 

109.521 

Xl, X 2 / X 4 

119.650 

4 

入入 2/ 入 3, 入 4 

111.054 

入 1 / 入 3/ 入 4 

119,861 


Example 


procedures for model selection may not be feasible. In such cases, stepwise selection proce-> 
dures are generally employed. The stepwise procedures discussed in Section 9.4 formu 啤 1$ 
linear regression are easily adapted for use in logistic regression. The only change required 
concerns the decision rule for adding or deleting a predictor. For multiple linear regression 
this decision is based on the 卜 value associated with bk, and its P-value. For logisti 
regression, we obtain an analogous procedure by basing the decision on the Wald statist!, 
z* in (14.53b) for the A:th estimated regression parameter, and its P-value. With this chang! 
implementation of the various stepwise variants, such as the forward stepwise, forwar 
selection, and backward elimination algorithms is straightforward. We illustrate the use9 
forward stepwise selection for the disease outbreak data. 

Figure 14.11 provides partial output from the SPSS forward stepwise selection proced^ 
for the disease outbreak example. This routine will add a predictor only if the 
associated with its Wald test statistic is less than 0.05. In step one, city sector (Kn) 


⑴ 

Parameters 

P 


TABLE 14.6 Best Subsets Results — Disease Outbreak Example. 

(a) Results for AH Possible Models (X ；； = 1 if X ； in model /; = 0 otherwise) 




o o o 1 o o 1 o 1 o 1 1 o 1 11 1 


f2 


o o 1 o o 1 o o n n o 1 n o 


1222233333344445 


e, 

d / 1234567890123456 
o 1 1 1 1 1 1 1 

M 





Chapter 14 Logistic Regression, Poisson Regression, and Generalized Linear Models 585 


FIGURE 14.10 


Plots of AIC P , SBC pt and —2 Iog e JL(b)Disease Outbreak Example, 
(a) AlC p versus p (b) SBC p versus p 




(c) -2 \og e (b) versus p 



P 


mmm 14.11 

^"^BrtialOutput 


^fifiSESS 



acemnise. 

聪 (lllure 一 
咖 



Logistic Regression 

Block 1 : Method = Forward Stepwise (Wald) 


Variables in the Equation 



B 

S.E. 

Wald 

df 

Sig. 

Exp(B) 

Step 1 a SECTOR 

1.743 

.473 

13.593 

1 

1 

.ooo i 

5.716 

Constant 

-3.332 

.765 

18.990 

1 

.000 ' 

.036 

Step 2 b ACE 

.029 

.013 

4.946 

1 

.026 

1.030 

SECTOR 

1.673 

.487 

11.791 

1 

.001 | 

5.BB1 

Constant 

-4.009 

.873 

21.060 

1 

.000 1 

_1 

.018 


a. Variab?e(s) entered on step 1 : SECTOR. 

b. Variables) entered on step 2; ACL 


entered; its P-value .000. In Step 2, age (X t ) is entered, with a 尸 -value of 0.026. At this 
point the procedure terminates, because no further predictors can be added with resulting 
尸 -values less than 0.05. Thus, the forward stepwise selection^procedure has identified the 
same model favored by AIC p and SBC p . Notice that SPSS also prints the square of the Wald 
test statistics from (14.53b) in the column labeled “Wald.” As noted earlier, when (z*) 2 
is used, P-values are obtained from a chi-square distribution with 1 degree of freedom. 




586 Part Three Nofilifiear Regression 


14.7 


Tests for Goodness of ]"it 


The appropriateness of the fitted logistic regression model needs to be examined before it is 
accepted for use, as is the case for all regression models. In particular, we need to examine 
whether the estimated response function for the data is nionotonic and sigmoidal in shape 
key properties of the logistic response function. Goodness of fit tests provide an overall 
measure of the fit of the model, and are usually not sensitive when the fit is poor for just a 
few cases. Logistic regression diagnostics, which focus on individual cases, will be taken 
up in the next section. 


Before discussing several goodness of fit tests, it is necessary to again distinguish between 
replicated and unreplicated binary data. In Sections 3.7 and 6 . 8 , we discussed the F testfc^ 
lack-of-fit for the simple and multiple linear regression models. For si mple linear regression, 
the lack-of-fit test requires repeat observations at one or more levels of the single predictor 
X, and, for multiple regression, there must be multiple or repeat observations that have the 
same values for all of the predictors. This requirement also holds true foiitwo of the goodness 
of fit tests that we will present for logistic regression, namely, the Pearson chi-square and 
the dev iance goodness of fit tests. Then we present the Hosmer-Leniesho w test that is useful 
for unreplicated data sets or for data sets containing just a few replicated observations. 


Pearson Chi-Square Goodness of Fit Test 


The Pearson chi-square goodness of fit test assumes only that the Y；j observations are 
independent and that replicated data of reasonable sample size are available. The test can 
detect major departures from a logistic response function, but is not sensitive to small 
departures from a logistic response function. The alternatives of interest are: 


H 0 ： £{y} = [l + exp(—X'P )]- 1 


(14.64) 


As was the case with tests for lack-of-fit in simple and multiple linear regression, W£^ 


shall denote the number of distinct combinations of the predictor variables by c, the 
binary response at predictor combination X 7 - by Y^, and the number of cases in the ythclasfc 
{j = 1. c) will be denoted by nj. Recall from (14.32a) that: ^ 


= Kj 04.6S ； 

i=l 

The number of cases in the jth class with outcome 1 will be denoted Oj\ and the nura 
of cases in the jth class with outcome 0 will be denoted by O 7 o- Because the respo 
variable K；, is a Bernoulli variable whose outcomes are 1 and 0, the number of cases 0 
and Oj 2 are given as follows: ^ 


O t 




(1466 


Ojo = 1 一 Yij) 二 n j ~ Y-j = n j — Oji (14 .夸 

1 — 

/ =1 ^ 


for 7 = c. 



Chapter 14 Logistic Regression, Poisson Regression，and Generalized Linear Models 587 


Expected 

f/o 

165.3 


Number of Coupons 
Redeemed 


Observed 

Expected 

Oyl 


30 

34.7 

55 

50.9 

70 

71.2 

ido 

94.6 

137 

140.6 


n i 


Pi 

Observed 

Oy0 

200 

•1736 

.150 

170 

200 

.2543 

•275 

145 

200 

.3562 

.350 

130 

200 

..a** .*： : 

.4731 

.500 

100 

•- 

200 

,7028 

.685 

63 


Number of Coupons 
Not Redeemed'* 


Glass 

i 

\ ：] 

4 

,5 


If the logistic response function is appropriate, the expected value of Yij is given by: 

£{^} = 7Zj = [1 + exp(-X^)]- 1 (14.67) 

and is estimated by the fitted value nj ： 

7tj = [1 + expC-X^b)]- 1 (14.68) 

Consequently, if the logistic response function is appropriate, the expected numbers of cases 
with Yij = 1 and = 0 for the 7 th class are estimated to be: 

Eji = njTtj (14.6%) 


E j0 = nj(\ - ftj) = rij - E yi 


(14.69b) 


where Ej' denotes the estimated expected number of Is in the yth class, and Ej 0 denotes 
the estimated expected number of Os in the 7 th class. ^ 

The test statistic is the usual chi-square goodness of fit test statistic ： 


心 EE 


k 二 0 


(Ojk — E jk y 
E Jk 


(14.70) 


If the logistic response function is appropriate, X 2 follows approximately a x 2 distribution 
with c ~ p degrees of freedom when rij is large and p < c. As with other chi-square 
goodness of fit tests, it is advisable that most expected frequencies Ej^ be moderately large, 
say 5 or greater, and none smaller than 1. 

Large values of the test statistic X 2 indicate that the logistic response function is not 
appropriate. The decision rule for testing the alternatives in (14.64), when controlling the 
level of significance at a, therefore is: 


If X 2 < x 2 (l —ol\c~ p), conclude H 0 
If X 2 > x 2 (l —or,c — /?),conclude H a 


(14.71) 


For the coupon effectiveness example, we have five classes. Table 14.7 provides for each 
class j: rij, the number of binary outcomes; itj, the model-based estimate of ttj ； pj, the 
observed proportion of Is; Ojq and 0 Jl7 the number of cases with = 0 and Y-^ = 1 
for each class; and finally, the estimated expected frequencies Ej 0 and Ej\, if the logistic 
regression model (14.35) is appropriate (calculations not shown). 


• 1 - 8 - 4 - 4 
9 - 8 - 5 - 9 - 

4, 2 o 5 
111 



88 Part Three Nonlinear Regression 


Test statistic (14.76) is calculated as follows: 

2 (170— 165.3) 2 (30 - 34.7) 2 (137 - I40.6) 2 

~ 165.3 34.7 140.6 

= 2. 15 

For a = 0,05 and c — = 5 — 2 = 3, we require x 2 (-95; 3) =7.81. Since X 2 = 2.15 <7.8] 

we conclude Hq, that the logistic response function is appropriate. The P-value of the test 
is .54. 

Deviance Goodness of Fit Test 

The deviance goodness of fit test for logistic regression models is completely analogous 
to the F test for lack of fit for simple and multiple linear regression models. Like the 
F test for lack of fit and the Pearson chi-square goodness of fit test, we assume there 
are c unique combinations of the predictors denoted Xi,..., X c , the number of repeat 
binary observations at Xj is nj, and the /th binary response at predictor combination Xj is 
denoted Y t j, 

The lack of fit test for standard regression was based on the general linear test of the 
reduced model E{Y ；j } = X)p against the full model £(K,y] = /x,-. In similar fashion, the 
deviance goodness of fit test is based on a likelihood ratio test of the reduced model: 

E[Yij} = [I + expf—X^p)]" 1 Reduced model (1472) 

against the full model: 

E{Yjj ] = Ti- j = I,..., c Full model (14.73) 

where jt ； are parameters, j = I,... T c. In the lack of fit test for standard regression, the 
full model allowed for a unique mean for each unique combination of the predictors, X】. 
Similarly, the full model for the deviance goodness of fit test allows for a unique probability 
7ij for each predictor combination. This full model in the logistic regression case is usually 
referred to as the saturated model. 

To carry out the likelihood ratio test in (14.60), we must obtain the values of the maxi¬ 
mized likelihoods foi. the full and reduced models, namely L(F) and L(R). L{R) is obtained 
by fitting the reduced model, and the maximum likelihood estimates of the c parameters in 
the full model are given by the sample proportions in (14.32b): 

Pj — — j = 1, 2,..., c (14.74) 

Letting ftj denote the reduced model estimate of rtj at Xj,j = I__ c, it can be showtt 

that likelihood ratio test statistic (14.60) is given by: 

G 2 = ~2[log, L(/?)~ log, L(F)] 

-~ 2 E 卜 ( 每 )+ ( 〜 D ㈣ (|^ 

= DEV(X 0 , X . X p ^) 




Chapter 14 Logistic Regression, Poisson Regressiorh and Generalized Linear Models 589 


Example 

， -.y.v. ■ _ 


The likelihood ratio test statistic in (14.75) is called the deviance, and we use 
DEV(X 0 , X l7 , X p ^i) to denote the deviance for a logistic regression model based on 

predictors X 0 , X u _Z p _i，The deviance measures the deviation, in terms of ~21og e L, 

between the saturated model and the fitted reduced logistic regression model based on 

i ， ，*. 9 p - 1 . 

If the logistic response function is the correct response function and the sample sizes nj 
are large, then the deviance will follow approximately a chi-square distribution with c — p 
degrees of freedom. Large values of the deviance indicate that the fitted logistic model is 
not correct. Hence, to test the alternatives; 


H 0 ' 五 {F} = [l+ex P (-X'pXT 1 

H a : E{Y} ^[1 + expC-X^)]- 1 

the appropriate decision rule is: 

If DEV{Xq, X u … ， X p ^i) < / 2 (1 — or，c — p) f conclude H 0 
lfDEV(X 0 , X { , •.. ， X p ^ { ) > x 2 (l — «;c — p), conclude H a 


(14.76) 


(14.77) 


For the coupon effectiveness example, we use the results in Table 14.2 to calculate the 
deviance in (14.75) directly ： 


DEV(X 0 , X,) = 一 2 301og e 


/.1736\ 
V .150 ) 


+ (200 - 30) log. 


A 8264) 
V .850 ) 


+ ••- + 137 log e 
= 2.16 


/.7028\ 
V .685 ) 


+ (200- 137)log. 


/.2972\ 
V .315 ) 


Fora = .05 and c ~ = 3, we require x 2 (-95;3) = 7.81. SinceZ)£V(X 0 , Xi) = 2.16 < 

7.81, we conclude that the logistic model is a satisfactory fit. The 尸 -value of this test 
is approximately .54, the same as that obtained earlier for the Pearson chi-square goodness 
of fit test. 


Comment - 

If pj = 0 for some j in the first term in (14.75), then Y rJ 


0 and: 




© =0 


Similarly, if pj = 1 for some j in the second term in (14.75), then Yj = nj and: 


(nj - Yj) log e 


n ； 


1 -'PJ 


■ 


^er-Lemeshow Goodness of Fit Test ^ " 

Hosmer and Lemeshow (Reference 14.4) proposed, for either unreplicated data sets or 
data sets with few replicates, the grouping of cases based on the values of the estimated 
^ probabilities. Suppose there are no replicates, i.e., rij — 1 for all j. The procedure consists 

of grouping the data into classes with similar fitted values fti, with approximately the same 



590 Part Three Nonlinear Regres‘ 、 io" 


TABLE 14.8 Hosmer-Lemeshow Goodness of Fit Test for Logistic Regression Function — Disease 
Outbreak Example. 


Number of Persons Number of Persons 

without Disease with Disease 


Class 

j 

7rJ Interval 

•V 

Observed 

O；o 

Expected 
f /o 

Observed 

0>i 

Expected 

E n 

1 

—2.60 — under —2.08 

20 

19 

18.196 

1 

1.804 

2 

—2.08 ― under —1.43 

20 

17 

17.093 

3 

2.907 

3 

—1.43 — under —.70 

20 

14 

14.707 

6 

5.293 

4 

—.70 — under .16 

19 

9 

10.887 

10 

8.113 

5 

.16 — under 1.70 

19 

8 

6.297 

11 

12.703 


Total 

98 

67 

67.180 

31 

30.820 


Example 


number of cases in each class. The grouping may be accomplished equivalently by using 
the fitted logit values 7T- = X|b since the logit values ft I are monotonically related to the 
fitted mean responses 贵卜 We shall do the grouping according to the fitted logit values 
Use of ftom 5 to 10 classes is common, depending on the total number of cases. Once 
the groups are formed, then the Hosmer-Lemeshow goodness of fit statistic is calculated 
by using the Pearson chi-square test statistic (14.70) from the c x 2 table of observed 
and expected frequencies as described earlier. Hosmer and Lemeshow showed, using an 
extensive simulation study, that the test statistic (14.70) is well approximated by the chi- 
square distribution with c ~2 degrees of freedom. 

For the disease outbreak example, we shall use five classes. Table 14.8 shows the class 
intervals for the logit fitted values it- and the number of cases nj in each class. It also gives 
Op and 0；], the number of cases with Y-, — 0 and F, = I for each class. Finally, Table 14.8 
contains the estimated expected frequencies Ejq and E- }1 based on logistic regression model 
(14.46) (calculations not shown). 

Test statistic (14.70) is calculated as follows: 

_ (19 - I8.I96) 2 (I - 1.804) 2 (8 - 6.297) 2 (11 - 12.703) 2 

~ 18.196 + ~~1.804~~ + •’• + ~~6.297~~+ 12.703^ 

=1.98 

Since all of the n j are approximately 20 and only two expected frequencies are less than5 
and both are greater than I, the chi-square testis appropriate here. Fora = .05 and c—2-X 
we require / 2 (.95; 3) = 7.81. Since X 2 1.98 < 7.81, we conclude that the logistic 
response function is appropriate. The P-value of the test is .58. 


Comment 

We have noted that the Pearson chi 〜 square and deviance goodness of fit tests are only appropriate when 
there are repeat observations and when the number of replicates at each X category is siifficientiy 
large. Care must be taken in interpreting logistic regression output since some packages will provide 
these statistics and the associated P-values whether or not sufficient numbers of replicate observations 
are present. : 



Chapter 14 Logistic Regression, Poisson Regression, and Generalized linear Models 591 


14.8 Logistic Regression Diagnostics 

— - - 

In this section we take up the analysis of residuals and the identification of influential cases 
for logistic regression. We shall first introduce various residuals that have been defined for 
logistic regression and some associated plots. We then turn to the identification of influential 
observations. Throughout, we shall assume that the responses are binary; i.e., we focus on 
the ungrouped case. 


Logistic Regression Residuals 

Residual analysis for logistic regression is more difficult than for linear regression models 
because the responses Y t take on only the values 0 and 1. Consequently, the ith ordinary 
residual, e- t will assume one of two values: 


(14.78) 

The ordinary residuals will not be normally distributed and, indeed, their distribution under 
the assumption that the fitted model is correct is unknown. Plots of ordinary residuals against 
fitted values or predictor variables will generally be uninfOTmative. 


f 1 ~ 77/ if Yi = : 1 
1 ~fti if R 二 0 


Pearson Residuals. The ordinary residuals can be made more comparable by dividing 
them by the estimated standard error of Yi, namely, l ~ 先). The resulting Pearson 
residuals are given by: 


r 一 1 一 
rPi VMi — 先 -) 


(14.79) 


The Pearson residuals are directly related to Pearson chi-square goodness of fit statistic 
(14.70). To see this we first expand (14.70) as follows: 








m + 亡 ( m (] 4 . 79a) 

For binary outcome data, we set j = i, c = n, Oj^ = Y if Oj 0 = 1 — Yi, Ej^ = fti, 
Ejq = 1 — 7t h and (14.79a) becomes: 


Y 2 [(1 * — (1 — jt /)] 2 十 

x -- + 衿 

= ^ (匕 一 龟 ) 2 A (7/ - 7 T f -) 2 
^ 1 -Ttf ^ ft. 


(Yj - Ttif 


(14.79b) 


Hence, we see that the sum of the squares of the Pearson residuals (1479) is numerically 
equal to the Pearson chi-square test statistic (14.79a). Therefore the square of each Pearson 
residual measures the contribution of each binary response to the Pearson chi-square test 
statistic. Note that test statistic (14.79b) does not follow an approximate chi-square distri¬ 
bution for binary data without replicates. 



592 Part Three Noiilinetir Regression 


Studentized Pearson Residuals. The Pearson residuals do not have unit variance since 
no allowance has been made for the inherent variation in the fitted value tt,-. A better 
procedure is to divide the ordinary residuals by their estimated standard deviation. This 
value is approximated by Vjf, (I — Tti )(F — /?//), where h ；； is the /th diagonal element of 
the /? x n estimated hat matrix for logistic regression: 

H = W 〗 X(X'WX) ’X'W^ (14.80) 

Here, W is the n x n diagonal matrix with elements jt, (l — n, ), X is the usual n x p design 
matrix (6.18b), and is a diagonal matrix with diagonal elements ； equal to the square 
roots of those in W. The resulting studentized Pearson residuals are defined as: 


f'sp ,= 


r P ( 

Vi — 


(14.ts1) 


Recall that for multiple linear regression, the hat matrix satisfies the matrix expression 
Y = HY. The hat matrix for logistic regression is developed in analogous fashion; it satisfies 
approximately the expression n = HY, where ft' is the (/? x I) vector of linear prectictois 

Deviance Residuals. The model deviance (14.75) was obtained by carrying out the likeli¬ 
hood ratio test where the reduced model is the logistic regression model and the full model 
is the saturated model for grouped outcome data. For binary outcome data, we take the 
number of X categoiies to be c = n, nj = \, j = i, Yj = Y；, pj = Yj/nj =： 1^, and 
(14.75) becomes: 

，•二 I 

=log^JT；) + (i - Y；) log,(l - Jt,-) - Yi log"(y,‘ ）一 （i — K,) log,(I - Y : )] 

n 

二 —2[IK, log 诉 ,）+ (i - log^l-Ti；)] (14.82) 


^ log, 


.又 


(1 - ^/) log. 


丌 ,- 




since Y t log fJ (^-) = (I — K ( ) log t ,(I — K,) = 0 for Y, = 0 or Y ； ~ I. Thus for binary data 
the model deviance in (14.75) is: 


DEV(X 0 ,..., X p ^) = log,(Ti,) + (1 - K,) Iog,(l- ji,)| (14.82a) 

/ 二 I 

The deviance residual for case i, denoted by c/ev,-, is defined as the signed square rootcf 
the contribution of the /th case to the model deviance DEV in (14.82a): 

dev ； - sign(Yi - jt ( ) \/--2[Y ； g/l 二 硕 。今的 ) 

where the sign is positive when Y ； > it,- and negative when Y,- < ft-,. Thus the sum of the 
squared deviance residuals equals the model deviance in (14.82a): 

// 

〉 (dev t ') 2 = DEV (X 0 , XX /7 _i) ^ 



Chapter 14 Logistic Regression, Poisson Regression, and Generalized Linear Models 593 


TABLE 14.9 


0) 

(2) 

s-- 

(3) 

(4) 

(5) 

⑹ 

(7) 

Logistic 

Reeresaon 

i 

Yi 

负 


fPi 

f$Pi 

dev t 

h l{ 

Residuals and 
Hat Matrix 
Di^onal 

1 

0 

0.209 

-0.209 

-0.514 

—0.524 

-0.685 

.039 

2 

0 

0.219 

一 0.219 5 

—0.529 

-0.541 

-0.703 

.040 

3 

0 

0.106 

-0.106 

-0.344 

-0.35Q 

—0.473 

.033 

Elements — 

- * * 

•,, 

* * ' 


* * - 

' * * 

• * » 

... 

Disease 

niitbreak 

96 

0 

0.114 

-0.114 

—0.358 

-0.363 

-0.491 

.025 

97 

0 

0.092 

-0.092 

—0.318 

-0.322 

-0.43? 

•024 

Example. 

98 

0 

0.171 

—0.171 

-0.455 

—0.463 

~o.6ii 

.036 


tf 


Therefore the square of each deviance residual measures the contribution of each binary 
response to the deviance goodness of fit test statistic (14.82a). Note that test statistic (14.82a) 
does not follow an approximate chi-square distribution for binary data without replicates. 

Table 14.9 lists in columns 1-7, for a portion of the disease outbreak example, the re¬ 
sponse Yi, the predicted mean response jt,-, the ordinary residual e,-，the Pearson residual 
r P ., the studdhtized Pearson residual r SPi , the deviance residual dev h and the hat matrix 
diagonal elements ha ，We illustrate the calculations needed to obtain these residuals for the 
first case. The ordinary residual for the first case is from (14.78): 

e\ ~ Y\ — ft\ — Q ~ .209 = 一 .209 






The various residuals are plotted against the predicted mean response in Figure 14.12, 
although we emphasize that such plots are not particularly informative. Consider, for exam¬ 
ple, the ordinary residuals in Figure 14.12a. Here we see two trends of decreasing residuals 
with slope equal to —1. These two linear trends result fromlhe fact, noted above, that the 
residuals take on just one of two values at a point Xi, 1 — rti or 0 — jt,-. Plotting these values 
against 先 will always result in two linear trends with slope — 1. The remaining plots lead 
to similar patterns. 



594 Part Three Nonlinear Regression 


(w ^spi 


versus tt ； 


© 


cu 




Or 






0<3D 


QD 0 




00) 






o 


o 


o 


o 


I I ! i I ] I I 1 


0,0 0J 0.2 03 0.4 0,5 0,6 0J 0,8 0,9 0,0 0.1 0,2 0.3 0,4 0,5 0.6 0J 0.8 0,9 

Estimated Probability Estimated Probability 

Diagnostic Residual Plots 

In this section we consider two useful residual plots that provide some information 此 out 
the adequacy of the logistic regression fit. Recall that in ordinary regression, residual plots 
are useful for diagnosing model inadequacy, nonconstant variance, and the presence of 
response outliers. In logistic regression, we generally focus only on the detection of model 
、 inadequacy. As we discussed in Section 14.1, nonconstant variance is always present in the 
logistic regression setting, and the form that it takes is known. Moreover, response outliere 
in binary logistic regression are difficult to diagnose and may only be evident if all responses 
in a particular region of the X space have the same response value except one or two. Thus 
we focus here on model adequacy. 

Residuals versus Predicted Probabilities with Lowess Smooth. If the logistic regres¬ 
sion model is correct, then E{Yi] = 7z t and it follows asymptotically that: 

E{Yi~7ti} E{ei)^Q 

This suggests that if the model is correct, a lowess smooth of the plot of the readu- 
als against the estimated probability 岛 （or against the linear predictor if!) should resu.lt 
approximately in a horizontal line with zero intercept. Any significant departure from this 


FIGURE 14.12 Selected Residuals Plotted against Predicted Mean Response — Disease Outbreak Example. 

(a) e ； versus ir ； (b) r p; versus it-, 




lenpjs3a:UOSJea;d pazllcapnls 



Chapter 14 Logistic Regression^ Poisson Regression，and Generalized Linear Models 595 


(XI 0.2 0.3 0.4 0.5 0.6 0-7 0-8 (X9 
Estimated Probability 


© 


°o 


o 


Of 


^Or 


°OOCb 


OGD 




如 o 




娜 




O 


o 


o 


o 


>•0 (XI 0,2 0,3 0.4 0.5 0_6 0.7 0-8 0-9 
Estimated Probability 



<& 


o, 


o 


o. 


3 O r 


0 (2) 


O 






o ( 


J _ L 


3-2-1 0 1 

Linear Predictor 


2 


(c) dev ； versus if,- 


(d) deVj versus ir- 


FIGURE 14.13 Residual Plots with Lowess Smooth ― Disease Outbreak Example. 

(a) r s versus ir-, (b) r sp versus vr ； 


suggests that the model may be inadequate. In practice, the lowess smooth of the ordinary 
residuals, the Pearson residuals, or the studentized Pearson residuals can be employed. 
(Further details regaining the plotting of logistic regression residuals can be found in 
Reference 14.5.) 

Shown in Figures 14.13a-d are residual plots for the disease outbreak example, each with 
the suggested lowess smooth superimposed. (We used the MINITAB lowess option with 
degree of smoothing equal to .7 and number of steps equal to 0 to produce these plots.) In 
Figures 14.13a and 14.13b, the studentized Pearson residuals are plotted respectively against 
the estimated probability and the linear predictor. Figures 14.13c and 14,13d provide similar 
plots for the deviance residuals. In all cases, the lowess smooth approximates a line having 
zero slope and intercept, and we conclude that no significant model inadequacy is apparent. 

Half-Normal Probability Plot with Simulated Envelope. A half-normal probability plot 
of the deviance residuals with a simulated envelope is useful both for examining the adequacy 
of the linear part of the logistic regression model and for identifying deviance residuals that 
are outlying. A half-normal probability plot helps to highlight outlying deviance residuals 
even though the residuals are not normally distributed. In a normal probability plot, the k\h 



lenp-^juosjead pazjluapn- 




len^sa^UOSJeadlpaNIVUapims 



596 Part Three Nonlinear Regression 


Example 


ordered residual is plotted against the percentile z\(k — 315)/(n 十 .25)1 or againsl y/MS^ 
times this percentile, as shown in (3.6). In a half-normal probability plot, the kth ordered 
absolute residual is plotted against: 



+ /? - 1/8 

2n~+T/2 


(14.84) 


Outliers will appear at the top right of a ha If-normal probability plot as points separated 
from the others. However, a half-normal plot ol the absolute residuals will not necessarily 
give a straight line even when the fitted model is in fact correct. 

To identify outlying deviance residuals, we combine a half-normal probability plot with a 
simulated envelope (Reference 14.6). This envelope constitutes a band such that the plotted 
residuals are all likely to fall within the band if the fitted mode] is correct. 

A simulated envelope fora half-normal probability plot of the absolute deviance residuals 
is constructed in the following way; 

].For each of the n cases, generate a Bernoulli outcome (0 ，1 ), where the Bernoulli 
parameter for case / is jr,-, the estimated probability of response F, = l v according to the 
originally fitted model. 

2. Fit the logistic regression model for the n new responses where the predictor variables 
keep their original values, and obtain the deviance residuals. Order the absolute deviance 
residuals in ascending order. 

3. Repeat the first two steps 18 times. 

4. Assemble the smallest absolute deviance residuals from the 19 groups and determine 
the minimum value, the mean, and the maximum value of these 19 residuals. 

5. Repeat step 4 by assembling the group of second smallest absolute residuals, the group 
of third smallest absolute residuals, etc. 

6 . Plot the minimum, mean, and tnaximum values for each of the n ordered residual 
groups against the corresponding expected value in (14.84) on the half-normal probability 
plot for the original data and connect the points by straight lines. 

By using 19 simulations, there is one chance in 20, or 5 percent, that the largest absolute 
deviance residual from the original data set lies outside the simulated envelope when the 
fitted model is correct. Large deviations of points from the means of the simulated values 
or the occurrence of points near to or outside the simulated envelope, are indications that 
the fitted model is not appropriate. 

Table 14.10a repeats a portion of the data for the disease outbreak example, as well as the 
fitted values for the logistic regression model. It also contains a portion of the simulated 
responses for the 19 simulation samples. For instance, the simulated responses for case I 
were obtained by generating Bennoulli random outcomes with probability rti = .209. 

Table 14.10b shows some of the ordered absolute deviance residuals for the 19 simulation 
samples. Finally, Table 14.10c presents the minimum, mean, and maximum for the 19 sim¬ 
ulation samples for some of the rank order positions, the ordered absolute deviance for 
the original sample for these rank order positions, and corresponding r. percentiles. The 
results in Table 14.10c are plotted in Figure 14.14. We see clearly from this figure 
the largest deviance residuals (which here correspond to cases 5 and 14) are farthest totfiC 
right and are somewhat separated from the other cases. However, they fall well within 



Chapter 14 Logistic Regression, Poisson Regression, and Generalized Linear Models 597 


^URE 14.14 
■6rmal 











598 Part Three Nonlinear R^re.sston 


simulated envelope so that remedial measures do not appear to be required. Figure I 4 iq 
also shows that most of the absolute deviance residuals fall near the simulation means 
suggesting that the logistic regression model is appropriate here. ’ 

Detection of Influential Observations 

In this section we introduce three measures that can be used to identify influential ob 
servations. We consider the influence of individual binary cases on three aspects of the 
analysis: 

1. The Pearson chi-square statistic (14.79b). 

2. The deviance statistic (14.82a). 

3. The fitted linear predictor, jt-. 

As was the case in standard regression situations, we will employ case-deletion diagnostics 
to assess the effect of individual cases on the results of the analysis. 

Influence on Pearson Chi-Square and the Deviance Statistics. Let X 2 and DEV denote 
the Pearson and deviance statistics (14.79b) and (14.82a) based on the full data set, and let 
X^ n and D£V U -) denote the values of these test statistics when case / is deleted. The ith 
delta clu-squore statistic is defined as the change in the Pearson statistic when the fthcase 
is deleted: 

㈣ 二 X 2 - 

Similarly, the /'th delta deviance statistic is defined as the change in the deviance statistic 
when the / th case is deleted: 

/S.dev, = DEV — D£V") 

Determination of the n delta chi-square statistics or the n delta deviance statistics requires] 
n maximizations of the likelihood, which can be time consuming. For faster computing, the 
following one-step approximations have been developed: 

= rl P< ( 14 * 85 ) 

Acievj = ^a r sp, + de\^ ( 1 娜 

In summary, /S.Xj and i^dev, give the change in the Pearson chi-square and deviance, 
statistics, respectively, when the / th case is deleted. They therefore provide measui^es of the 
influence of the / th case on these summary statistics. 

Interpretation of the delta chi-square and delta deviance statistics is not always a simple 
matter. In standard regression situations, we employ various rules of thumb for judging th 
magnitude of a regression diagnostic. An example of this is the Bonfei*roni outlier test (Se^ 
tion 10.2) that is used in conjunction with the studentized deleted residiml (10.26). Anoth 
is the use of various percentiles of the F distribution for interpretation of Cook's distan ; 
(Section 10.4). Guidelines such as these are generally not available for logistic i^gressiQ 
as the distribution of the delta statistics is unknown except under certain restrictive ass 呵 
tions. The judgment as to whether or not a case is outlying or overly influential is typ ic ^ 
made on the basis of a subjective visual assessment of an appropriate graphic. Usually, 
delta chi-square and delta deviance statistics tire plotted against case number i, against 



Chapter 14 Logistic Regression, Poisson Regression, and Generalized Linear Models 599 


TABLE 14.11 Pearson Residuals, Studentized Pearson Residuals, Hat Di 哗 onals，Deviance Residuals, Delta 
Clu-SQuare and Delta Deviance Statistics, and Cook’s Distance — Disease Outbreak Example. 







^■Pi : 

-0,514 
二 0.529 
-0344 


I 



^G^58 
— 0 _ 
- 0^55 


(2) 

(3) 

rsp, 

hr, 

-0:524 

.039 

—0.541 

.040 

六 0.350 

.033 

-0.363 

.025, 

-0.322 

.024 

#0.463 

.036 



(5) 

cfevj 

A 部 

-^£).685 

0.275 


0.292 

4G.473 

0/122 J 

-s i 

* •. • 

-0.491 

0^132 

二 M3? 

0.104 

-0.613 

0.214 



(7) 

Adevf 

% 

0.479 

0.002 

0；506 

: 0.002 

. •:,， 

0.228 

0.0G1 

0.245 

0:001 

0.T95 

5 r，: ' <-： ^ 

0.001 

0.383 

tdibc^ 


or against n-. Extreme values appear as spikes when plotted against case-number, or as 
outliers in the upper comers of the plot when plotted against 允 ， or 7r/. 


■ —- - - 

Example 


Table 14,11 lists in columns 1-6 for a portion of the disease outbreak data the Pearson 
residuals r Pn thfe studentized Pearson residuals r S p,, the hat matrix diagonal elements hn, 
the deviance residuals, dev if the delta chi-square statistics AX?, and the delta deviance 
residuals t^devi. We illustrate the calculations needed to obtain AX?, and Adevt, for the first 
case. As noted in (1485) the first delta chi-square statistic is given by the square of the first 
studentized Pearson residual: 


AX i - r sp 7 : (_.524) 2 = .275 

Using (14,86) with h\\ — .039 and dev\ = —.685 from columns 3 and 4 of Table 14.11 ， 
the first delta deviance statistic is; 

AJevi = h''r】 Pi + dev\ = .039(—.524) 2 十 (—.685) 2 = .479 

Figures 14.15a and 14.15b provide index plots of the delta chi-square and delta deviance 
statistics for the disease outbreak example. The two spikes corresponding to cases 5 and 14 
indicate clearly that these cases have the largest values of the delta deviance and delta chi- 
square statistics. Shown just below each of these in Figures 14.15c and 14.15d are plots of 
the delta chi-square and delta deviance statistics against the model-estimated probabilities. 
Note that cases 5 and 14 again stand out~this time in the upper left comer of the plot. The 
results suggest that cases 5 and 14 may substantively affect the conclusions. The cases were 
therefore flagged for potential remedial action at a later stage of the analysis. 

Influence on the Fitted Linear Predictor: Cook’s Distance. In Chapter 10, we intro¬ 
duced Cook’s distance statistic, D t -, for the identification of influential observations. We 
noted that for the standard regression case D ( measures the standardized change in the 
fitted response vector Y when the/th case is deleted. Similarly, Cook’s distance for logistic 
regression measures the standardized change in the linear predictor Jti when the /th case 
is deleted. Like the delta statistics described above, obtaining these values exactly requires 
n maximizations of the likelihood. Instead, the following one-step approximation is used 




600 Part Three Nonlinear Regression 


FIGURE 14.15 Delta Chi-Square and Delta Deviance Plots — Disease Outbreak Example. 


(a) AX?versus / 


9 


(b) AdeVf versus / 


5 


0 10 20 30 40 50 60 70 80 90 100 
Case 


(c) AX ^versus vr ； 


0 10 20 BO 40 50 60 70 80 90 100 
Case 


(d) AdeVf versus it ； 



0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 
Estimated Probability 


0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 
Estimated Probability 


(Reference 14.5): 


A = 


： hi ； 


pa- h n y 


(14.87) 


Example 


Index plots of leverage values ha are useful for identifying outliers in the X space, and 
index plots of A can be used to identify cases that have a large effect on the fitted linear 
predictor. As was the case with the delta chi-square and delta deviance statistics, rules of 
thumb forjudging the magnitudes of these diagnostics are not available, and we must rely 
on a visual assessment of an appropriate graphic. Note that influence on both the deviance 
(or Pearson chi-square) statistic and the linear predictor can be assessed simultaneously 
using 技 proportional influence or bubble plot of the delta deviance (or delta chi-squar e ) 
statistics, in which the area of the plot symbol is proportional to D h 

Cook’s distances are listed in column 7 of Table 14.11 for a portion of the disease outbreak 
example. To illustrate the calculation of Cook’s distance we again focus on the first case. 
We require h u = .039, r 朽 =—.514 from columns 1 and 3 of Table 14.11. Then, we have 



sue!>aa al'cl 



aJenbsiLQ- laa 


§ 


o 


Q 

o 


oo' 




o 





no 


o 


o 



aJenbs~LD-laa 



Chapter 14 Logistic Regression, Poisson Regression, and Generalized Linear Models 601 


10 20 30 40 50 60 70 80 90 
Case Index 


(c) Proportional-Influence Plot 


o 


o 


o 


o 


o 




O 


o 


o 


- ® 


°Oo 


0.1 0.3 0.5 0,7 0.9 

Estimated Probability 



10 20 30 40 50 60 70 80 90 
Case Index 


from (14.87) with p = 5: 

A = 

Figures 14.16a-c display an index plot of /z"，an index plot of D u and a proportional- 
influence plot of the delta deviance statistics. The leverage plot identifies case 48 as being 
somewhat outlying in the X space — and therefore potentially influential — and the plot of 
Cook’s distances indicates that case 48 is indeed the most influential in terms of effect on 
the linear predictor. Note that cases 5 and 14~~previously identified as most influential in 
terms of their effect on the Pearson chi-square and deviance statistics — have relatively less 
influence on the linear predictor. This is shown also by the proportional-influence plot in 
Figure 14.16c. These two cases, which have the largest delta deviance values, are located 
in the upper left region of the plot. The plot symbols for these cases are not overly large, 
indicating that these cases are not particularly influential in terms of the fitted linear predictor 
values. Case 48 was temporarily deleted and the logistic regression fit was obtained (not 
shown). The results were not appreciably different from those obtained from the full data 
set, and the case was retained. * • 


r%h n (-.514) 2 (.039) 
1 - hu) 2 = 5(1 ,039) 2 


: .0022 


cirURE 14.16 Index Plots of Leverage Values, Cook’s Distances, and Proportional-Influence Plot of Delta 
Deviance Statistic— Disease Outbreak Example. 

(a) hjf versus / (b) D ； versus i 


0,12 


Case 48 


0.08 h 
0.07 


Case 48 



lenpls- 

suejAaa-aa pajen- 



14.9 Inl( j (mh c 8 about Mean Hcsponye 


— 

Frequently, estimation of the probability n for one or several different sets of values of the 
predictor variables is required. In the disease outbreak example, for instance, there maybe 
interest in the probability of 10 -year-old persons of lower socioeconomic status living in 
city sector I having contracted the disease. 


Point Estimator 

As usual, we denote the vector of the levels of the X variables for which n is to be estimated 
byX ,,： ^ 



(U.88) 


and the mean response of interest by jt/, : 

= [ I + exp(-X;,P)] _l (14.89) 

The point estimator of will be denoted by n/, and is as follows: 

ft,, = [I +exp(-X ； ,b )]- 1 (14.90) 

where b is the vector of estimated regression coefficients in (14.43). 


Interval Estimation 

We obtain a confidence interval for jt/, in two stages. First, we calculate confidence limits 
for the logit mean response 7i r h . Then we use the relation (14.38a) to obtain confidence limits 
for the mean response 丌 /,. To see this clearly, we consider (14.38a) for X = X/,: 

£1^} = fl +exp(-X ； ,P)l 1 

and restate the expression by using the fact that E{Y b ] — jt/, and XJ ( p = jrJ: 

jt" = M + exp(- 7 r,；)] ' (14.91) 

It is this relation in (14.91) that we utilize to convert confidence limits for n' b into confidence 
limits for jt, ( . 

The point estimator of the logit mean response Ti' h = XJ ( P is Tt' u = XJ ( b. The estimated 
approximate variance of n',, — X^b according to (5.46) is: 

An；,} = 5 2 {X ； ,b] = X ； ( s 2 {b]X /( (14.92) 

where s 2 (b} is the estimated approximate variance-covariance matrix of the regression 
coefficients in (14.51) when /? is large. 

Approximate I — a large-sample confidence limits for the logit mean response Jt' h are 
then obtained in the usual fashion: 

L = Tt' b ~ z( \ — a/2)s{ft' h ) (14.93a) 

U = + z(I — a f2)s{ft ， u ) (14.93b) 

Here, L and U are, respectively, the lower and upper confidence limits for n' lr 



Chapter 14 Logistic Regression, Poisson Regression, and Generalized Linear Models 603 


Finally, we use the monotonic relation between 7th and 7r f h in (14.91) to convert the 
confidence limits L and U for 7^ into approximate 1 — a confidence limits L* and U* for 
the mean response n h : 

// = [1 + exp(-/,)]~ I (14.94a) 

= [1 十 exp(-iy)]- 1 (14.94b) 

I .i} 

Simultaneous Confidence Intervals for Several Mean Responses 

When it is desired to estimate several mean responses 7z h corresponding to different Xh 
vectors with family confidence coefficient 1—a, Bonferroni simultaneous confidence in¬ 
tervals maybe used. The procedure for 容 confidence intervals is the same as that for a single 
confidence interval except that z(l — a/2) in (14.93) is replaced by ^(1 — a/2g). 



In the disease outbreak example of Table 14.3, it is desired to find an approximate 95 percent 
confidence interval for the probability 7Zh that persons 10 years old who are of lower socio¬ 
economic status and live in sector 1 have contracted the disease. The vector X h in (14.88) 
here is ： 

_ 1 _ 

10 

X, = 0 

1 
0 


Using the results in Table 14.4a，we obtain the point estimate of the logit mean response: 


7T ； = X^b = -2.3129(1) 十 .02975(10) + .4088(0) - .30525(1) + 1.5747(0) 
= -2.32065 


The estimated variance of 7t f h is obtained by using (14.92) (calculations not shown): 

= .2945 

so that sin^} = .54268. For 1 — a = .95, we require z(.915) = 1.960. Hence, the confi¬ 
dence limits for the logit mean response n' h are according to (14.93): 

■» 

L = -2.32065 - 1.960(.54268) = -3.38430 
U = -2.32065 + 1.960(.54268) = -1.25700 

Finally，we use (14.94) to obtain the confidence limits for the mean response n h : 

L* = [1 + exp(3.38430)] _1 = .033 

= [1 + 找卩(1.25700)]— = .22 

Thus, the approximate 95 percent confidence interval for the mean response 7Zh is ： 

.033 <7th< .22 

We therefore find, with approximate 95 percent confidence, that the probability is between 
.033 and .22 that 10-year-old persons of lower socioeconomic status who live in sector 1 have 
contracted the disease. This confidence interval is useful for indicating that persons with 
the specified characteristics are not subject to a very high probability of having contracted 
the disease, but the confidence interval is quite wide and thus not precise. 



604 Part Three Nonlinear Re^ivysitm 


Comment 

The confidence limits forTT/j in (14.94) are not symmetric around the point estimate. In the dise 
outbreak example, for insumce. the point estimate is； 

7r /( = U + exp(2.32065)r l = .089 

while the confidence limits are .033 and .22. The reason lor the asymmetry is that ft,, is not a li near 
function or^. _ 

14.10 Prediction of a New OI)S(M*vation 


Multiple logistic regression is frequently employed for making predictions for new observa¬ 
tions. In one application, for example, health personnel wished to predict whether a certain i 
surgical procedure will ameliorate a new patient's condition, given the patient's age, gen-" 
der, and various symptoms. In another application, marketing officials of a computer firm 
wished to predict whether a retail chain will purchase a new computer, on the basis of the ^ 
age of the company's current computer, the company's current workload, and other factors » 

Choice of Prediction Rule 

Forecasting a binary outcome for given levels X/, of the X variables is simple in the sense 
that the outcome 1 will be predicted if the estimated value Tt h is large, and the outcome 0 
will be predicted if Ji/, is small. The difficulty in making predictions of a binary outcome is 
in determining the cutoff point, below which the outcome 0 is predicted and above which 
the outcome l is predicted. A variety of approaches are possible to determine where this 
cutoff point is to be located. We consider three approaches. 

1. Use .5 as the cutoff. With this approach, the prediction rule is: 

If fth exceeds .5, predict 1; otherwise predict 0. : 

This approach is reasonable when (a) it is equally likely in the population of interest that 
outcomes 0 and I will occur; and (b) the costs of incorrectly predicting 0 and I are approx¬ 
imately the same. ; 

2. Find the best cutoff for the data set on which the multiple logistic regression model ^ 

is based. This approach involves evaluating different cutoffs. For each cutoff, the rule is :: 
employed on the n cases in the model-building data set and the proportion of cases incorrectly [ 
predicted is ascertained. The cutoff for which the proportion of incorrect predictions is lowest j 
is the one to be employed. : 

This approach is reasonable when (a) the data set is a random sample from the i^elevant ^ 
population, and thus reflects the proper proportions of 0 s and Is in the population, and j 
(b) the costs of incorrectly predicting 0 and l are approximately the same. The proportion 
of inconect predictions observed for the optimal cutoff is likely to be an overstatement 
of the ability of the cutoff to correctly predict new observations, especially if the model- ^ 
building data set is not large. The reason is that the cutoff is chosen with reference to the \ 
same data set from which the logistic model was fitted and thus is best for these data only- 
Consequently, as we explained in Chapter 9, it is important that a validation data set be 
employed to indicate whether the observed predictive ability for a fitted regression model 
is a valid indicator for predicting new observations. 



Chapter 14 Logistic Regression, Poisson Regression, and Generalized Linear Models 605 


3. Use prior probabilities and costs of incorrect predictions in determining the cutoff. 
When prior information is available about the likelihood of Is and Os in the population 
and the data set is not a random sample from the population, the prior information can 
be used in finding an optimal cutoff. In addition, when the cost of incorrectly predicting 
outcome 1 differs substantially from the cost of incorrectly predicting outcome 0, these costs 
of incorrect consequences can be incorporated into the determination of the cutoff so that 
the expected cost of incorrect predictions will be minimized. Specialized references, such 
as Reference 14.7, discuss the use of prior information and costs of incorrect predictions 
for determining the optimal cutoff. 


Example 


« 


We shall use the disease outbreak example of Table 14.3 to illus^ate howto obtain the cutoff 
point for predicting a new observation, even though the main purpose of that study was to 
determine whether age, socioeconomic status, and city sector are important risk factors. 
We assume that the cost of incorrectly predicting that a person has contracted the disease 
is about the same as the cost of incorrectly predicting that a person has not contracted the 
disease. The estimated logistic response function is given in (14.46). 

Since a random sample of individuals was selected in the two city sectors, the 98 cases in 
the study constitute a cross section of the relevant population. Consequently, information is 
provided in the sample about the proportion of persons who have contracted the disease in 
the population. Of the 98 persons in the study, 31 had contracted the disease (see the disease 
outbreak data set in Appendix CIO); hence the estimated proportion of persons who had 
contracted the disease is 31/98 = .316. This proportion can be used as the starting point in 
the search for the best cutoff in the prediction rule. 

Thus, the first rule investigated was: 

Predict l ifrt h > .316; predict 0 if rt h < .316 (14.95) 

Note from Table 14.3, column 6, that ^ = .209 for case 1; hence prediction rule (14.95) 
calls for a prediction that the person has not contracted the disease. This would be a correct 
prediction. Similarly, prediction rule (14.95) would correctly predict cases 2 and 3 not to 
have contracted the disease. However, the prediction with rule (14.95) for case 4 (person has 
contracted the disease because n A = 371 > .316) would be incorrect. Similarly, the predic¬ 
tion for case 5 (person has not contracted the disease because 7Ts = .111 < .316) would be 
incorrect. Table 14.12a provides a summary of the number of correct and incorrect classi¬ 
fications based on prediction rule (14.95). Of the 67 persons without the disease, 20 would 
be incorrectly predicted to have contracted the disease, or an error rate of 29.9 percent 


TABLE 14.12 Classification Based on Logistic Response Function (14.46) and Prediction Rules 
(14.95) and (14.96) — Disease Outbreak Example. 



I ^cation 

I ^ = 0 

I %tal 



(a) Rule (14.95) 


= 0 

? = ^ 

Total 

47 

20 

67 

8 

23 

31 

55 

43 

98 


(b) Rule (14.96) 


= 0 

f = 1 

Total 

50 

17 

67 

9 

22 

31 

59 

39 

98 


606 Part Three Nonlinear Regression 


Of the 31 persons with the disease, eight would be incorrectly predicted with rule (14 9 ^ 
not to have contracted the disease, or 25.8 percent Altogether, 20 + 8 — 28 of the 98 predic 
tions would be incorrect, so that the prediction error rate for rule (14.95) is 28/98 = .286 or 
28.6 percent. 

Similar analyses were made for other cutoff points and it appears that among the cutoffs 
considered, use of the following rule may be best: 


Predict 1 if 7 r ft > .325; predict 0 if 7 T^ < .325 


(M.96) 


Table 14.12b provides a summary of the correct and incorrect classifications based on 
prediction rule (14.96). The prediction error rate for this rule is (9 十 17)/98 — .265 or 
26.5 percent. Note also that for this rule, the error rates for persons with and without the 
disease (9/31 and 17/67) are quite close to each other. Thus, the risks of incorrect predictions 
for the two groups are fairly balanced, which is often desirable. Note also that the error 
rates for persons with and without the disease are much less balanced as the cutoff is shifted 
further away from the optimal one in either direction. 

An effective way to display this information graphically is through the receiver oper¬ 
ating characteristic (ROC) curve, which plots P(Y = 1|K = 1) (also called sensitivity) as 
a function of 1 — P(Y = 0| K = 0) (also called 1 specificity) for the possible outpoints 
Figure 14.17 exhibits the ROC curve for model (14.46) for all possible outpoints between 
0 and 1. (See A.7a for the definition of conditional probability.) 

To see how a single point on the ROC curve in Figure 14.17 is determined, we consido 1 
rule (14.95), for which the cutoff is .316. From Table 14.12a, the sensitivity is: 




FIGURE 14.17 

JMPROC 

Curve — 

Disease 

Outbreak 

Example. 


% 

Receiver Operating Characteristic Curve 


1.00 

0.90 


^ 0.80 
：| 0.70 

I 0.60 



0.10 

0.00 



.00 .10 .20 .30 .40 .50 .60 .70 .80 .90 1.00 
1 -Specificity, False Positive 


Using Y = T to be the positive level 
Area Under Curve = 0.77684 



Chapter 14 Logistic Regression，Poisson Regression, and Generalized Linear Models 607 


Also, 1 —specificity here is: 

1 - P(Y =0|r =0) = 1 - — = .30 

67 

This point is highlighted on the ROC curve in Figure 14.17. 

The area under the ROC curve is a useful summary measure of the model’s predictive 
power and is identical to the concordance index. Consider any pair of observations (i, j) 
such that Y { = 1 and Yj = 0 . Since Y { > Yj, this pair is said to be concordant if ir,- > rtj. 
The concordance index estimates the probability that the predictions and the outcomes are 
concordant (Reference 14.2). A value of 0.5 means that the predictions were no better than 
random guessing. For the disease outbreak model (14.96), the ROC area is 0.777. 

A validation study will now be required to determine whether the observed prediction 
error rate for the optimal cutoff properly indicates the risks of incorrect predictions for new 
observations, or whether it seriously understates them. In any case, it appears already that 
fitted logistic regression model (14.96) may not be too useful as a predictive model because 
of the relatively high risks of making incorrect predictions. 


Comment 

A limitation of the prediction rule approach is that it dichotomizes a continuous predictor jt where 
the choice of cutpoint Tt h is arbitrary and is highly dependent upon the relative frequencies of Is and 
0 s observed in the sample. ■ 

ation of Prediction Error Rate 

The reliability of the prediction error rate observed in the model-building data set is exam¬ 
ined by applying the chosen prediction rule to a validation data set. If the new prediction 
error rate is about the same as that for the model-building data set, then the latter gives a 
reliable indication of the predictive ability of the fitted logistic regression model and the 
chosen prediction rule. If the new data lead to a considerably higher prediction error rate, 
then the fitted logistic regression model and the chosen prediction rule do not predict new 
observations as well as originally indicated. 

p| e In the disease outbreak example, the fitted logistic regression function (14.46) based on the 
- model-building data set: 

jt = [1 + exp(-3.8877 - .02975& - .4088Z 2 + .30525X 3 - 1.5747 兄 O ]— 1 

was used to calculate estimated probabilities fth for cases 99 - 196 in the disease outbreak data 
set in Appendix C.10. These cases constitute the validation data set. The chosen prediction 
rule (14.96): 


Predict 1 if > .325; predict Qifft h < .325 



608 Part Three Nonlinear Re^n\ssion 


was then applied to these estimated probabilities. The percent prediction error rates were 
as follows: 


Disease Status 


With 

Without 


Disease 

Disease 

Total 

46.2 

38.9 

40.8 


Note that the total prediction error rate of 40.8 percent is considerably higher than the 
26.5 percent error mte based on the model-building data set. The latter therefore is not a 
reliable indicator of the predictive capability of the fitted logistic regression model and the 
chosen prediction rule. 

We should mention again that making predictions was not the primary objective in the 
disease outbreak study. Rather, the main purpose was to identify key explanatory variables. 
Still, the prediction error rate for the validation data set shows that there must be other key 
explanatory variables affecting whether a person has contracted the disease that have not 
yet been identified for inclusion in the logistic regression model. 


Comment 

An alternative to multiple logistic regression for predicting a binary response variable when the 
predictor variables are continuous is discriminant analysis. This approach assumes that the predictor 
variables follow a joint multivariate normal distribution. Discriminant analysis can also he used when 
this condition is not mel, but the approach is not optimal then and logistic regression frequently 
is preferable. The reader is refeired to Reference 14.8 for an in-depth discussion of discriminant 
analysis. I 


14.11 Polytomous Logistic Ro ( i>i (\ssioii for Nominal Response 

Logistic regression is most frequently used to model the relationship between a dichotomous 
response variable and a set of predictor variables. On occasion, however, the response 
variable may have moi.e than two levels. Logistic regression can still be employed by 
means of a polytomous — or multicategon - — logistic regression model. Polytomous logistic 
regression models are used in many fields. In business, for instance, a market researcher 
may wish to relate a consumer's choice of product (product A, product B, product C) to 
the consumer's age, gender, geographic location, and several other potential explanatory 
variables. This is* an example of nominal polytomous regression, because the response 
categories are purely qualitative and not ordered in any way. Ordinal response categories can 
also be modeled using polytomous regression. For example, the relation between severity 
of disease measured on an ordinal scale (mild, moderate, severe) and age of patient, gender 
of patient, and some other explanatory variables may be of interest. We consider ordind 
polytomous logistic regression in detail in Section 14.12. 

In this section we discuss the use of polytomous logistic regression for nominal multi- 
category responses. Throughout, we will use the pregnancy duration example, introduced 
in Section 14.2 in the context of binary logistic regression, to illustrate concepts. This time ， 
however, the response will have more than two categories. 



Chapter 14 Logistic Regression t Poisson Regression, and Generalized Linear Models 609 


⑹ ⑺ 

Age-Category 

X i2 X/3 

0 0 - 

1 0 

0 0 


(9) 

Smoking 

History 

^is 

1 

0 

1 


1 


0 ) 

Jpuration 

Y } 

1 

% 1 

3 1 

^ 3 

3 

Nl ， 


(2 ) ⑶ (4) 

Response Category 


Yn 

1 

1 

1 

0 

0 

0 


Yi2 

0 

0 

0 

0 

0 

0 


Yis 

0 

0 

0 


(5) 

Nutritional 

Status 

Xn 

150 

124 

128 

117 

165 

134 


pregnancy Duration Data with Polytomous Response 

A study was undertaken to determine the strength of association between several risk factors 
and the duration of pregnancies. The risk factors considered were mother’s age, nutritional 
status, history of tobacco use, and history of alcohol use. The response of interest, pregnancy 
duration, is a three-categoiy variable that was coded as follows: 


Y} Pregnancy Duration Category 

1 Preterm (less than 36 weeks) 

2 Intermediate term (36 to 37 weeks) 

3 Full term (38 weeks or greater) 


Relevant data for 102 women who had recently given birth at a large metropolitan hospital 
were obtained. A portion of these data is displayed in Table 14.13. The polytomous response, 
pregancy duration (K), is shown in column 1. Nutritional status (Xi), shown in column 5, is 
an index of nutritional status (higher score denotes better nutritional status). The predictor 
variable age was categorized into three groups: less than 20 years of age (coded 1 ), from 21 
to 30 years of age (coded 2), and greater than 30 years of age (coded 3). It is represented by 
two indicator variables (X 2 and X 3 ), shown in columns 6 and 7 of Table 14.13, as follows: 


Class 

x 2 

入 3 

Less than or equal to 20 years of age 

1 

0 

21 to 30 years of age 

0 

0 

Greater than 30 years of age 

0 

1 


(The researchers chose the middle category 一 21 to 30 years of age—as the referent category 
for this qualitative predictor because mothers in this age group tend to have the lowest risk 
of preterm deliveries. This leads to positive regression coefficients for these predictors, and 
a slightly simpler interpretation.) Alcohol and smoking history were also qualitative pre¬ 
dictors; the categories were “Yes” (coded 1 ) and “No” (coded 0). Alcohol use history (X 4 ), 
and smoking history (X5) are listed in columns 8 and 9 of Table 14.13. 

* 

TABLE 14.13 Data~Pregnancy Duration Example with Polytomous Response. 


; o°o 


-000 

-■ 


se 

u»*y 


⑻一 stoAMo/o o 

0 9 
c h 


T '1 1 


Because pregnancy duration is a qualitative variable with three categories, we wiUcreat 
three binary response variables', one for each response category as follows: e 




0 


if case / response is category 
otherwise 




{o 


if case / response is category 2 
otherwise 


Y,y = 


{o 


if case / response is category 3 
otherwise 


These three coded variables are also included in Table 14.13 in columns 2, 3, and 4. Note 
that becciuse Y,i + K ,： + = I, the value of any one of these three binary variables can 

be determined from the other two. For example, = I — — Y j2 - 

We first treat pregnancy duration as a nominal response, ignoring the time-based ordering 
of the categories: later we will show how a more parsimonious model results when we treat 
pregnancy duration as tin ordinal response. 


/ 一 1 Baseline-Category Logits for Nominal Response 

Iii general, we will assume there are J response categories. Then for the /th observation^ 
there will be J binary response variables, K,i, ..., Yfj, where: 




if case / response is category j 
otherwise 


Since only one category can be selected for response /, we have: 


；=i 

We will require some additional notation for the inulticategory case. First, let jt" denote 
the probability that category j is selected for the / th response. Then: 

冗 ij = PiYij = 0 

In the binary case, J =2. Suppose that we code K, = 1 if the /th response is category 1, 
and we code Y,- = 0 if the /th response is category 2. Then: 


Ti i — 7i,- 1 and 1—71/ = 71(2 

For binary logistic regression, we model the logit of jt, using the linear predictor. Since 
there are only two categories in binary logistic regression, the logit in fact compares the 
probability of a category-1 response to the probability of a category-2 response: 

^ = log,. 

Note thal we have used 7i ( - p and P ]2 to emphasize that the linear predictor is modeling the 
logarithm of the ratio of the probabilities for categories 1 and 2. 

Now for the J polytomous categories, there are J(J — 1)/2 pairs of categories, and 
therefore J(J — 1)/2 linear predictors. For example, for the pregnancy duration data, 


71 ,- 


丌 / 




丌 n 
L 丌 


—丌 /|2 = X ； Pi ： 



Chapter 14 Logistic Regression, Poisson Regression, and Generalized linear Models 611 


7 — 3 and we have 3(3 — 1)/2 = 3 comparisons: 

777 1 


^i\2 ~ 1 0 匕 
< 3 = 10g e 


n m 


l°g e 


■ 丌 i2. 
7T f _l 
7773 

'^2 

7T/3 


=X ； P I2 
= X ； p l3 
二 X ； p 23 


Fortunately, it is not necessary to develop all 7(7 ― 1)/2 logistic regression models. One 
category will be chosen as the baseline or referent category, and then all other categories will 
be compared to it. The choice of baseline or referent category is arbitrary. Frequently the 
last category is chosen and, indeed, this is usually the default choice for statistical software 
programs. One exception to this maybe found in epidemiological studies, where the category 
having the lowest risk is often used as the referent category. 

Using category J to denote the baseline category, we need consider only the 7 — 1 
comparisons to this referent category. The logit for the jth such comparison is: 




兀 iJ 




/ = 1 , 2 ,..., 7 


(14.97a) 


Since it is understood that comparisons are always made to category J, we let 7 r^ = 
and Pj = in (14.97a), giving: 


<_ = log e 


7T" 

冗 U 




1,2, ...,J 


(14.97b) 


The reason that we need to consider only these 7 — 1 logits is that the logits for any 
other comparisons can be obtained from them. To see this, suppose 7 = 4, and we wish to 
compare categories 1 and 2. Then: 




兀 ii 
冗 n 


=log e 

7T/1 

兀 i4 


X —— 


- 兀 i4 

^• 2 . 


w 

-log 


_ 7774 - 



Ten 

TT/4. 


= x;_u_p 2 

In general, to compare categories k and /, we have: 

• 丌 ik 




兀 il 


m - 私） 


(14.98) 


Given the J — 1 logit expressions in (14.98) it is possible (algebra not shown) to obtain 
the 7 — 1 direct expressions for the category probabilities in terms of the J — 1 linear 
predictors, 尸 The resulting expressions are: 

exp(X ； .p y ) . 


沉 ij = 


1 , 2 , ...,7 


(14.99) 


i + O 桃） 

We next consider methods for obtaining estimates of the 7 — 1 parameter vectors 
尽2， . …， Pj - r 



612 Part Three Nonlinear Regression 


Maximum Likelihood Estimation 


There are two approaches commonly used for obtaining estimates of the parameter vectors 

.P 7 _|; both employ maximum likelihood estimation. With the first approach, separate 

binary logistic regressions are carried out for each of the J — I comparisons to the baseline 
category. For example, to estimate p,, we drop from the data set all cases except those % 
which either Y, \ = 1 or = I. Since only two categories are then present, we can apply 
binary logistic regression directly. This approach is particularly useful when statistical 
software is not available for multicategory logistic regression (Reference 14.9). 


A more effective approach from a statistical viewpoint is to obtain estimates of the 
J — 1 logits simultaneously. To do so, we require the likelihood for the full data set. To fix 
ideas, suppose that there are J = 4 categories and that the third category is selected for the 
/th response. That is, for case i we have: 


y,-] =0 y ；2 = o = i y,4 = o 

The probability of this response is: 

P(K/ = 3) = 7t (3 

= lTT /1 1° X [丌 / 2 ] U X fe] 1 x l7T ( - 4 J 0 


= n [开 ,7]〜 

?-=l 


For n independent observations and J categories, it is easily seen that the likelihood is: 

II II " J " 

p(Yu .=n p{Yi) = n ID 71 " 少 (14.100) 

i=\ i. 

It can be shown that the log likelihood is given by: 

// / j-\ 

iog,tP(r, ， … ， h)I = Z - log, 

，=i \j=i 

The maximum likelihood estimates of p,., are those values. b|. 卜 that 

maximize (14.101). As usual, we will rely on standard statistical software programs to 
obtain these estimates. 

As was the case for binary logistic regression, the J — 1 fitted response functions may be 
obtained by substituting the maximum likelihood estimates of the i — 1 parameter vectors 
into the expression in (14.99): 


j-\ 


1 + ^exp(X ； P y ) 


( 14 . 101 ) 


, = exp(X ； b 7 ) 

~ — l+E^expfXlb,) 

We turn now to an example to illustrate the analysis and interpretation of a nominal-level 
polytomous logistic regression model. 







Chapter 14 Logistic Regression, Poisson Regression, and Generalized Linear Models 613 


4-—- — I For the pregnancy duration data in Table 14.13 ， a set of •/ — 1 = 2 first-order linear predictors 

- — was initially proposed: 



7T" 


L^/3 



for y = 1, 2 


MINITAB ’s nominal logistic r 堪 ression output is displayed in Figure 14.18. It first indicates 
that the response had three levels, 1, 2, and 3, and that the referent response event is 
Y t = 3. Following this summary is the logistic regression table, which contains the estimated 
regression coefficients, estimated approximate standard errors, the Wald test statistics and 
尸 -values，the estimated odds ratios for the two estimated linear predictors, and the 95 percent 
confidence intervals for the odds ratios. The maximum likelihood estimates of p, and p 2 are: 

5.475 " 

— 0.0654 
2.9570 
2.0597 
2.0429 
2.4524_ 

Before using the fitted model to make inferences, various regression diagnostics similar to 
those already discussed for binary logistic regression should be examined. In polytomous 
logistic regression, the multiple outcome categories make this a more difficult problem 


b, 


3.958 

—0.0464 


2.9135 

1.8875 

b 2 = 

1.0670' 


2.2305^ 



FIGURE 14.18 
MINITAB 

-；.-'r 

Nominal 

Logistic 

Regression 



Pregnancy 

Duration 

Example. 


Response 

Information 


Variable 

Value Count 

preterm 

3 

41 


2 

35 


1 

26 


Total 

102 

Logistic Regression Table 

Predictor 

Coef 


Logit 1 : 

(2/3) 


Constan't 

3.958 


nutritio 

-0.04645 


agecatl 

2*9135 


agecat3 

1.8875 


alcohol 

1.0670 


smoking 

2*2305 


Logit 2: 

(1/3) 


Constant 

5*475 


nutritio 

-0*06542 


agecatl 

2.9570 


agecat3 

2.0597 


alcohol 

2*0429 


smoking 

2.4524 



Polytomous Nominal MTB Output 


(Reference Event) 


SE Coef 

Z 

1.941 

2.04 

0.01489 

-3.12 

0*8575 

3.40 

0.8088 

2.33 

0.6495 

1.64 

0.6682 

3*34 


2.272 

2*41 

0,01824 

-3,59 

0.9645 

3.07 

0.8947 

2.30 

0*7097 

2.88 

0*7315 

3.35 


Odds 

P Ratio 

0,041 

0.002 0*95 

0.001 18*42 

0.020 6.60 

0.100 2.91 

0.001 9.30 


0.016 

0.000 0.94 
0.002 19.24 
0*021 7*84 
0,004 7*71 
0.001 11*62 


95% Cl 

Lower Upper 

0,93 0.98 

3*43 98.91 

1+35 32.23 

0+81 10.38 

2.51 34.47 


0+90 0.97 
2.91 127.41 
1,36 45.30 
1*92 31.00 
2.77 48.72 


Log-likelihood = -84.338 

Test that all slopes are zero: G = 52.011, DF = 10, P-Value = 0.000 



614 Part Three Nonlinear Re^n\ssion 


than was the case for binary logistic regression. We thus recommend assessing the fit 
and monitoring logistic regression diagnostics using the 7 — j individual binary logi st j c 
regressions, as described in the first paragraph on page 612. Hence, we would assess the fit 
of the two logistic regression models separately, and then make a statement about the fit of 
the polytomous logistic model descriptively. Diagnostics, including the Hosmer-Le m eshow 
test for goodness of fit, simulated envelopes for deviance residuals, and plots of influence 
statistics were examined for the pregnancy duration data, and no serious departures were 
found (results not shown). We turn now to model interpretation and inference. 

As indicated in Figure 14.18, all Wald test P-values are less than .05 — with the exception 
of alcohol in the first linear predictor ― indicating that all of the predictors should be retained 
In all cases, the direction of the association between the predictors and the estimated logit^ 
as indicated by the signs of the estimated regression coefficients, were as expected. 

For teenagers, the estimated odds of delivering preterm compared to full term are 
18.42 times the estimated odds for women 20-30 years of age; the 95% confidence in¬ 
terval for this odds ratio has a lower limit of 3.43 and an upper limit of 98.91. Thus while 
the age effect is estimated to be very large, there is considerable uncertainty in the estimate. 
Similarly, the estimated odds for teenagers of delivering intermediate term compared to 
full term are 19.24; the lower 95% confidence limit is 2.9 j and the upper limit is 127.41. 
History of smoking, history of alcohol use, and being in the 30-and-over age category also 
increase the estimated odds of delivering preterm or intermediate term compared to full 
term, though less dramatically. The negative estimated coefficients for nutritional status in¬ 
dicate that a lower nutritional status is associated with increased odds of delivering preterm 
or intermediate term compared to full term. 

Comment 

To derive expression (14.101) for the log likelihood, we first obtain the logarithm of (14.100) and 
let ttij = \ — n u ant ^ = ^ ^ that: 


/ I 

log, P{Y { . Y n )= L?r " J + 1 _ H ^ logf 


E 


冗 u 


_' j _' 


-卜 I 


- Yi > Iog - 


-E-/ 


j-i 


lo ^ 


^>1 

丌 " 


+ 


i-l 




Substitution of the expressions in (14.97b) for log^tTr^/Tr^/j and in (14,99) for in the second term 
leads to the desired log likelihood in (14,101 ). _ 


l 


14.12 Polytouious Logistir Repression for Ordinal Response 


Up to this point, we have considered polytomous logistic regression models for unordered 
categories. Categories, however, are frequently ordered. Consider the following response 
variables: 


1. A food product is rated by consumers on a 1-10 hedonic scale. 



Chapter 14 Logistic Regression, Poisson Regression, and Generalized Linear Models 615 


2. In an economic study, persons are classified as either not employed, employed part time, 
or employed full time. 

3. The quality of sheet metal produced is rated on a 1-5 scale, depending on the clarity and 
reflectivity of the surface. 

4. Employees are asked to rate working conditions using a 7-point scale (unacceptable, 
poor, fair, acceptable, good, excellent, outstanding). 

5. The severity of cancer is rated by stages on a 1—4 basis. 

Such responses can be analyzed by using the techniques for nominal logistic regression 
described in Section 14.11, but a more effective strategy, yielding a more parsimonious and 
more easily interpreted model, results if the ordering of the categories is taken into account 
explicitly. The model that is usually employed is called the proportional odds model. 

To motivate this model, we revisit the pregnancy duration example. We will assume that 
pregnancy duration is a continuous response denoted by Yf. For ease of exposition, we will 
also assume that there is just one (quantitiative) predictor, nutrition index, X n . Assume that 
Yf can be represented by the simple linear regression model: 

Yf =%+^n+ke L 

where e L follows the standard logistic distribution (14.14) with mean zero and standard 
deviation n/^/3, and /: is a constant that satisfies: 

cr{}7} =ka{s L } = k^= 


Researchers were interested in specific categories of pregnancy delivery time and therefore 
discretized pregnancy duration Yf using the following upperbounds or outpoints for each 
category: 


Y t 

Category 

yf 

Outpoint T 

1 

Preterm 

0 < Yf < 36 weeks 

h = 

36 weeks 

2 

Intermediate term 

36 weeks < Yf c 38 weeks 

T 2 = 

38 weeks 

3 

Full term 

38 weeks < Yf c oo 

^3 = 

OO 


The proportional odds model for ordinal logistic regression models the cumulative proba¬ 
bilities P(Yi < j) rather than the specific category probabilities P(Yi = j) as was the case 
for nominal logistic regression. We now develop the required expressions for the cumulative 
probabilities. 

For j = 1 we have: 


P{Yi <1) = P{Y[ < T { ) 

(14.103a) 


(14.103b) 

= P(ke L m — ^Xi) 

(14.103c) 


(14.103d) 

= P{^L S 0(i 十 爲兄） 

(14.103e) 




616 Part Three Nonlinear Regression 


where o；] = (7] 一说 、 /k and A = —^jk. Since e L follows a standard logistic distribution 
the cumulative probability in (14.103e) is obtained by using the cumulative distributi ’ 
function (14.14b): 10n 


P{Yi 5 1)= 丌 /i 


exp(a] +)6|X ( -) 


1 +exp(«[ + jS] X,) 
For j = 2, following the development in (14.103), we have: 

P{Y ; <2)= F(r；' < T 2 ) 

二 + ^ + ^ <r 2 ) 

= P(ke L sT 2 —^ — ^X ; ) 
T 2 - ^ 


P { £ ^ , 

P(^L ^ + /^] Xi) 

exp(a 2 十 

1 + exp(a 2 + X,) 


■X ； 


(I4.103f) 

(14.104a) 

(14.104b) 

(14.104c) 

(14.104d) 

(14.104e) 

(14.104f) 


Notice that the only difference between (14.103f) and (14.104f) involves the intercept 
terms a y and a 2 . The slopes 仇 are the same in both expressions. For the multiple regression 
case involving J ordered categories, we let: 



-X n " 


' ^ " 


h 








Equations (14.103f) and (14.104f) become for category / : 

exp(a y + X ； p) 


P^; < j) 


1 + exp(«y + x;-p) 


for j = 1,2,■■■,*/ — 


(14.105) 


Model (14.105) is often iieferred to as the proportional odds model. Taking the logit trans- 
foi'mation of both sides yields the J — 1 cumulative logits'- 


log. 


^-P(Y,<j)\ 


a ,- 十 for y = 1,..., y — I 


(14.106) 


The difference between the ordinal logits in (14.106) and the nominal logits in (14.97b) 
should now be clear. In the nominal case, each of the 7 — I parameter vectors Py is unique. 
For ordinal responses, the slope coefficient vectors P are identical for each of the / - 1 
cumulative logits, but the intercepts differ. 

As in the binary logistic regiession case, each slope parameter can again be interpreted 
as the change in the logarithm of an odds ratio — this time the cumulative odds ratio 一 f« r a 
unit change in its associated predictor. In general, (14.106) satisfies, fov j = 1, ■ ■ ■ ， J 一 


log. 


P(Y ； <k) P{Yj<k) 
[P(Y f >k)" P(Y f > k) 


(X,- - X 抑 


(14.107) 





Chapter 14 Logistic Regression, Poisson Regression, and Generalized Linear Models 617 


We now briefly discuss estimation methods before returning to the pregnancy duration 
example. 


Maximum Likelihood Estimation. As was the case for nominal logistic regression, 
separate binary logistic regressions can be used to obtain estimates of the J — 1 linear 
predictors in (14.106). For 7 = 1,..., J — 1, we construct the binary outcome variable: 





and carry out a logistic regression analysis based on . Note that this approach leads to 
J — \ separate estimates of the slope parameter vector p. 

A better approach, if the required software is available, is to estimate and 

P simultaneously using maximum likelihood estimation. From (14.100), the likelihood is 
given by: 


-..i 


n / J 


m …， o n n ㈣ 




n / J 


- P(Y t <j^ l)f- (14.108) 


Substitution of P(Yi < 7) = 1 , P(Yi < 0 ) == 0 , and the expression for PiYi < 7 ), 
j — 1,..., 7 — 1, in (14.105) yields the required expression for the likelihood in terms of 
aj-i, and p. The maximum likelihood estimates are those values of «i,..., cxj^i 
and p, namely, a u ..., aj-i and b that maximize (14.108). As always, we shall rely on 
standard statistical software to carry out the maximization. We now return to the pregnancy 
duration example. 


We continue the analysis of the pregnancy duration data, this time under the assumption 
that the response is ordinal, rather than nominal. Recall that Y t — 1 indicates preterm de¬ 
livery, Yi — 2 indicates intermediate-term delivery, and yi- = 3 indicates full-term delivery. 
MINITAB ordinal logistic regression output is shown in Figure 14.19. As required with 
7 = 3, the program provides estimates for two intercepts, a x = 2.930 and a 2 = 5.025, and 
p — 1=5 slope coefficients, = —.04887, b 2 = 1.9760, — 1.3635, b A = 1.5915, and 
b 5 = 1.6699. The Wald 尸 -values indicate that all of the regression coefficients are statisti¬ 
cally significant at the .05 level. 

As noted above, the coefficients can be interpreted as the change in the cumulative odds 
ratio for 技 unit change in the predictor. For example, the results indicate that the logarithm of 
the odds of a pre- or intermediate-term delivery (Y[ < 2) for smokers (X 5 = 1) is estimated 
tobe ^ 4 == 1.5915 times the logarithm of the odds for nonsmokers (X 5 = 0). The estimated 
cumulative odds ratio is given by exp(1.519) = 4.91 and a 95% confidence interval for 
the true cumulative odds ratio has a lower limit of 2.02 and an upper limit of 11.92. The 
remaining slope parameters can be interpreted in a similar fashion. 

Notice again that the interpretation of the ordinal logistic regression model is much 
simpler than that for the nominal logistic regression model, because only a single slope 
vector p is estimated. 



618 Part Three Nonlinear Regression 


FIGURE 14.19 

MINITAB 

Ordinal 

Logistic 

R^ression 

Output — 

Pregnancy 

Duration 

Example. 


Link Function: Logit 


Response Information 


Variable Value Count 

preterm 1 26 

2 35 

3 41 

Total 102 


Logistic Regression Table 


Predictor 

Coef 

SE Coef 

Z 

Const (1) 

2.930 

1,465 

2.00 

Const(2) 

5.025 

1,521 

3.30 

mitritio 

-0.04887 

0.01168 

-4.18 

agecatl 

1.9760 

0-5875 

3.36 

agecat3 

1.3635 

0-5547 

2.46 

smoking 

1.5915 

0-4525 

3.52 

alcohol 

1,6699 

0.4727 

3,53 



Odds 

95 % Cl 

Sr- 

P 

Ratio 

Lower 

Upper 

■f 

0,045 




» 

0-001 




各 . 

■f 

0.000 

0-95 

0.93 

0.97 

w 

0 - O 01 

7,21 

2.28 

22.82 


0-014 

3-91 

1.32 

11-60 

\ 

O - OO 0 

4,91 

2,02 

11-92 

I- 

0-000 

5,31 

2,10 

13.42 

L 

i 


Log-likelihood = -86.756 

Test that all slopes are zero ： G = 47.174, DF = 5, P-Value = 0-000 


Comment 

Our development of the proportional odds model assumed that the ordinal response Y, was obtained \ 
from an explicit discretization of an observed continuous response Yf, but this is not i^equired. This h 
model often works well for ordinal responses that do not arise from such a discretization. _ 


14.13 Poisson Regression_ f 

We consider now another nonlinear regression model where the response outcomes are 
discrete. Poisson regression is useful when the outcome is a count, with large-count out¬ 
comes being rare events. For instance, the number of times a household shops at a particular 
supermarket in a week is a count, with a large number of shopping trips to the store during 
the week being a rare event. A researcher may wish to study the relation between a family’s 
number of shopping trips to the store during a particular week and the family’s income, tl 

number of children, distance from the store, and some other explanatory variables. As an- ^ 

other example, the relation between the number of hospitalizations of a member of a health t- 
maintenance organization during the past year and the member’s age, income, and previous 
health status may be of interest. 


Poisson Distribution 

The Poisson distribution can be utilized for outcomes that are counts (Y,- = 0, 1, 2, ■ ■ ■)’ s： 
with a large count or frequency being a rare event. The Poisson probability distribution is ; 


as follows: 


Chapter 14 Logistic Regression% Poisson Regression, and Generalized Linear Models 619 


f(r) = 〆 ex P(-^ r = o，i，2，... （14.109) 

where f(Y) denotes the probability that the outcome is Y and K! — F(F — 1) ■ ■ ■ 3 • 2 - 1. 
The mean and variance of the Poisson probability distribution are: 

E{Y} = ii (14.110a) 

g 2 {Y] = il (14.110b) 

Note that the variance is the same as the mean. Hence, if the number of store trips follows 
the Poisson distribution and the mean number of store trips for a family with three children 
is larger than the mean number of trips for a family with no children, the variances of the 
distributions of outcomes for the two families will also differ. 

Comment , 

At times, the count responses Y will pertain to different units of time or space. For instance，in a 
survey intended to obtain the total number of store trips during a particular month, some of the counts 
pertained only to the Last week of the month. In such cases, let {i denote the mean response for Y 
fora unit of time or space (e.g., one month), and let t denote the number of units of time or space to 
which Y corresponds. For instance, / = 7/30 if Y is the number of store trips during one week where 
the unit time is one month; f 二 1 if 1" is the number of store trips during the month. The Poisson 
probability distribution is then expressed as follows: 

f(Y)^ (叫 y :广 屮) 7 = 0,1,2,... (14.111) 

Our discussion throughout this section assumes that all responses Y- t pertain to the same unit of time 
or space. ■ 

Poisson Regression Model 

The Poisson regression model, like any nonlinear regression model, can be stated as follows: 

Yi = E{Yi) + £i i = 1 , 2 ,..., r 

The mean response for the ith case, to be denoted now by 从/ for simplicity, is assumed 
as always to be a function of the set of predictor variables, X u , X p _i ， We use the 
notation P) to denote the function that relates the mean response 队 ■ to X,-, the values 
of the predictor variables for case i, and p, the values of the regression coefficients. Some 
commonly used functions for Poisson regression are: 

= m(X/, P) = exp(X；P) 

= log e (X；-p) 

In all three cases, the mean responses must be nonnegative. 

Since the distribution of the error terms e t for Poisson regression is a function of the 
distribution of the response Y u which is Poisson, it is easiest to state the Poisson regression 


(14.112a) 

(14.112b) 

(14.112c) 



620 Part Three Noi;/Uteur Re^i-ession 


model in the following form: 


Yj are independent Poisson random variables with expected 
values /i,-, where: 


/x , ■二 /x (X, ， p) 


04.113) 


The most commonly used response function is = exp(X'P). 


Maximum Likelihood Estimation 

For Poisson regression model (14.113), the likelihood function is as follows: 


/^i / =i ’■ 

_ {rxuMd p)] y -}ex P [- e;l, 

— 


(14.114) 


Once the functional form of p(X, 、 p) is chosen, the maximization of (14.114) produces 
the maximum likelihood estimates of the regression coefficients p. As before, it is easier to 
work with the logarithm of the likelihood function: 


log, = Y > log>(X/ ， P)] — p) — ^Iog,(r, !) (14.115) 

i=i /=i / =i 

Numerical search procedures are used to find the maximum likelihood estimates bo,b u 
Iteratively reweighted least squares can again be used to obtain these estimates. We 
shall rely on standard statistical software packages specifically designed to handle Poisson 
regression to obtain the niaximum likelihood estimates. 

After the maximum likelihood estimates have been found, we can obtain the fitted 
response function and the fitted values: 


A = /i(X,b) (14.116a) 

ft; = mh) (14.116b) 


For the three functions in (14.112), the fitted response functions and fitted values 


/x = x'p: 
fi = exp(X’P): 
^ = logJX'p): 


A 二 X，b 
jX — exp(X'b) 

A — log t ,(X f b) 


A,- = x;b 
A/ = exp(XJb) 
A/ = log ( ,(Xjb) 


(14.116c) 

(14.116d) 

(14.116e) 


Model Development 

Model development for a Poisson egression model is cairiecl out in a similar fashion 
to that for logistic regression, conducting tests for individual coefficients or groups of 
coefficients based on the likelihood ratio test statistic G 2 in (14.60). For Poisson regression 



Chapter 14 Logistic Regression, Poisson Regression, and Generalized Linear Models 621 


inferences 


|xample 


model (14.113), the model deviance is as follows: 


DEV(X 0 , X Y ， … ， X p 』= -2 



n 

+ ~ Ae) 


(14.117) 


where fXi is the fitted value for the ith case according to (14.116b). The deviance residual 
for the /th case is: 


devi = ± -2Y- t log e 

The sign of the deviance residual is selected according to whether Yi — jli is positive or neg¬ 
ative. Index plots of the deviance residuals and half-normal probability plots with simulated 
envelopes are useful for identifying outliers and checking the model fit. 


Yi 


- 2(Yi - M 


1/， 


(14.118) 


Comment 

If Y{ = 0, the term [17 log e (Af/ K)] in (14.117) and (14.118) equals 0. ■ 


Inferences for a Poisson regression model are carried out in the same way as for logistic 
regression. For instance, there is often interest in estimating the mean response for predictor 
variables X^. This estimate is obtained by substituting X h into (14.116). 

In Poisson regression analysis, there is sometimes also interest in estimating probabilities 
of certain outcomes for given levels of the predictor variables, for instance, P(Y = 01 X h ). 
Such an estimated probability can be obtained readily by substituting fih into (14.109). 

Interval estimation of individual regression coefficients can be carried out by use of the 
laige-sample estimated standard deviations furnished by regression programs with Poisson 
regression capabilities. 

The Miller Lumber Company is a large retailer of lumber and paint, as well as of plumbing, 
electrical, and other household supplies. During a representative two week period, in-store 
surveys were conducted and addresses of customers were obtained. The addresses were 
then used to identify the metropolitan area census tracts in which the customers reside. At 
the end of the survey period, the total number of customers who visited the store from each 
census tract within a 10-mile radius was determined and relevant demographic information 
for each tract (average income, number of housing units, etc.) was obtained. Several other 
variables expected to be related to customer counts were constructed from maps, including 
distance from census tract to nearest competitor and distance to store. 

Initial screening of the potential predictor variables was conducted which led to the 
retention of five predictor variables: 

X [： Number of housing units 
X 2 ： Average income, in dollars 
X 3 : Average housing unit age, in years 
X 4 : Distance to nearest competitor, in miles 
X 5 ： Distance to store, in miles 

Yi ： Number of customers who visited store from census tract 



622 Part Three Nonlinear Regression 


Company 

Example. Regression 

Coefficient 


Estimated 

Estimated 



Regression 

Standard 



Coefficient 

Deviation 

C 2 

P-value 

2.9424 

.207 



.0006058 

.00014 

18.21 

.000 

-.00001169 

.0000021 

31.80 

.000 

-.003726 

.0018 

4.38 

.036 

.1684 

.026 

41.66 

.000 

—.1288 

.016 

67.50 

.000 


Competitor 
Distance 
入 4 

3.04 

1.95 

6.54 


TABLE 14.15 
Fitted Poisson 
Response 
Function and 
Related 
Results — 
lVTIIIpr T Jimh^r 


(a) Fitted Poisson Response Function 


A = exp[2.942 + .000606X, - .0000117X 2 --00373X 3 + .168X 4 - .129X 5 ] 

DEV(X 0f X u X 2f X 3/ X 4r X 5 ) = 114.985 

(b) Estimated Coefficients, Standard Deviations, and C 2 Test Statistics 


Store 

Distance 

X 5 

6.32 

8.89 

2-05 


Number oT; 
Customers 
V 

9 

6 

28 


9.90 

9.51 

8.62 


6 

4 

6 


TABLE 14.14 
Data 一 Miller 
Lumber 
Company 
Example. 

Census 

Tract 

/ 

1 

2 

Housing 

Units 

X, 

606 

641 

Average 

Income 

x 2 

41,393 

23,635 

Average 

Age 

x 3 

3 

18 


3 

505 

55,475 

27 


108 

817 

54,429 

47 


109 

268 

34,022 

54 


110 

519 

52,850, 

43 


Data for a portion of the « = 110 census tracts are shown in Table 14.14. 

Poisson regression model (14.113) with response function: 

fi(X. P) = exp(X'P) 

was fitted to the data, using LISP-STAT (Reference 14.10). Some principal results are* 
presented in Table 14.15. Note that the deviance for this model is 114.985. 

Likelihood ratio test statistics (14.60) were calculated for each of the individual regres¬ 
sion coefficients. These G 2 test statistics are shown in Table 14.15b, together with their 
associated 尸 -values, each based on the chi-square distribution with one degree of freedom. 
We note from the 尸 -values that each predictor variable makes a marginal contribution to 
the fit of the regression model and consequently should be retained in the model. 

A portion of the deviance residuals devi is shown in Table 14.16, together with the 
responses F/ and the fitted values A/- Analysis of the deviance residuals did not disclose 
any major problems. Figure 14.20 contains an index plot of the deviance residuals. We 
note a few large negative deviance residuals; these are for census tracts where K = 0; i- e -» 


^ 0 ^^^ ^. ^ 


:902092 

.* • 




Chapter 14 Logistic Regression, Poisson Regression, and Generalized Linear Models 623 


fiGURE 14.20 
life Hot of 


Sduals— 
Miller Lumber 
jGoinpany 


—D C_I_I_I_1_I 

20 40 60 80 100 

Index 

there were no customers from these areas. These may be difficult cases to fit with a Poisson 
regression model. 

14.14 Generalized Linear Models 


We conclude this chapter and the regression portion of this book by noting that all of the 
regression models considered, linear and nonlinear, belong to a family of models called 
generalized linear models. This family was first introduced by Nelder and Wedderbum 
(Reference 14.11) and encompasses normal error linear regression models and the nonlinear 
exponential, logistic, and Poisson regression models, as well as many other models, such 
as log-linear models for categorical data. 

The class of generalized linear models can be described as follows: 

1. Y u . ..,Y n are« independent responses that follow a probability distribution belonging 
to the exponential family of probability distributions, with expected value £{F f } = 

2. A linear predictor based on the predictor variables X /,,. p _i is utilized, denoted 

byX ； p ： ’ 

X/p = A) + p'x;' +. • • + 

3. The link function g relates the linear predictor to the mean response: 

X；p = g(^) 


_ 陡 14 . 16 Census Tract 

Kg# nses » / 

#3 Values, , 

deviance ^ 

R^uals— 

刺 er Lumber 5 

no 


Yi 

A/ 

dev ； 

9 

12.3 

-.999 

6 

8.8 

-.992 

28 

28.1 

-.024 

6 

5.3 

» * ■ 

.289 

4 

4.4 

—.197 

6 

6.4 

-.171 




624 Part Three Nonlinear Regresyion 


Generalized linear models may have nonconstant variances erf for the responses y. ㈣ 
the variance of must be a function of the predictor variables through the mean response 
To illustrate the concept of the link function, consider first logistic regression mod^i 
{14.41). There, the logit transformation (Uj) in (14.18a) serves to link the linearpredicto 
XJP to the mean response fi,- = n；\ 


g(M/) = g(^i) = log f = x;.p 

As a second example, consider Poisson regression model (14.113). There we consid¬ 
ered several response functions in (14.112). For the response function p,- = exp(X'-p) i n 
(14.112b), the jinking relation is: 


We see from the Poisson regression models that there may be many different possible link 
functions that can be employed. They need only be monotonic and differentiable. 

Finally, we consider the normal error regression model in (6.7). There the link function 
is simply: 


= Mi 


since the linking relation is: 

X；-P = /x,- 

The link function gill-,) for the normal error case is called the identity or unity link function. 

Any regression model that belongs to the family of generalized linear models can be an¬ 
alyzed in a unified fashion. The maximum likelihood estimates of the regression parameters 
can be obtained by iteratively reweighted least squares『by ordinary least squares for normal 
error linear regression models (6.7)J. Tests for model development to determine whether 
some predictor variables may be dropped from the model can be conducted using likelihood 
ratio tests. Reference 14.12 provides further details about generalized linear models and 
their analysis. 


Cited 

References 


14.1. Kennedy, W. J., Jn, and J. E, Gentle. Statistical Computing. New Yovk; Marcel Dekker ， 1980, 

14.2. Agresti, A. Categorical Data Analysis. 2nd ed. New York; John Wiley & Sons, 2(X)2. 

14.3 - LogXact 5. Cytel Software Corporation. Cambridge, Massachusetts, 2(X)3 - 4 

14A Hosmer, D. W-, and S. Lemeshow. Applied Logistic Regression- 2nd ed. New Yoi ^： John 
Wiley & Sons, 2000. 

14.5 - Cook, R. D” and S. Weisberg, Applied Regression Including Computing and Graphics. New 
York: John Wiley & Sons, 1999, 

14.6. Atkinson, A. C. “Two Graphical Displays for Outlying and Influential Observations in 
Regression:’ Biometnka 68 (1981), pp. 13—20. 

14.7. Johnson, R. A., and D, W. Wichern. Applied Multivariate Statistical Analysis. 5th ed, 
Englewood Cliffs, NJ,: Prentice HalK 2(X)1. 

14.8. Lachenbruch, R A. Discrimitutnt Analysis. New York: Hafner Press, 1975. 

14*9 - Begg ， C. B” and R. Gray. "Calculation of Polytomous Logistic Regression Parameters Using 
Individualized Regressions，” Biometrika 71 (1984), pp. 11-18, 




Chapter 14 Logistic Regression, Poisson Regression, and. Generalized Linear Models 625 


14.10. Tierney, L. LISP-STAT: An Object-Oriented Environment for Statistical Computing and 
Dynamic Graphics. New York: John Wiley & Sons, 1990. 

14.11. Nelder, J. A., and R. W. M. Wedderbum. “Generalized Linear Models，” Journal of the Royal 
Statistical Society A 135 (1972), pp. 370-84. 

14.12. McCuIIagh, P., and J. A. Nelder. Generalized Linear Models. 2nd ed. London: Chapman and 
Hall, 1999. 




14.1. A student stated: “I fail to see why the response function needs to be constrained between 0 
and 1 when the response variable is binary and has a Bernoulli distribution. The fit to 0,1 data 
will take care of this problem for any response function.” Comment. 

14.2. Since the logit transformation (14.18) linearizes the logistic response function, why can’t this 
transformation be used on the individual responses Y ； and a linear response function then 
fitted? Explain. 

14.3. If the true response function is J-shaped when the response variable is binary, would the use 
of the logistic response function be appropriate? Explain. 

14.4. a. Plot the logistic mean response function (14.16) when f} 0 — —25 and f}\ — .2. 

b. For what value of the mean response equal to .5? 

c. Find the odds when X = 150, when 151， and the ratio of the odds when X = 151 to 
the odds when X = 150. Is this odds ratio equal to exp(/S,) as it should be? 

*14.5. a. Plot the logistic mean response function (14.16) when jSq =20 and 的 =— .2. 

b. For what value of X is the mean response equal to .5? 

c. Find the odds when X — 125, when X = 126, and the ratio of the odds when X = 126 to 
the odds when X — 125. Is the odds ratio equal to exp(/^) as it should be? 

14.6. a. Plot the probit mean response function (14.12) for ^ = —25 and How does this 

function compare to the logistic mean response function in part (a) of Problem 14.4? 
b. For what value of X is the mean response equal to .5? 

* 14.7. Annual dues. The board of directors of a professional association conducted a random sample 
survey of 30 members to assess the effects of several possible amounts of dues increase. The 
sample results follow. X denotes the dollar increase in annual dues posited in the survey 
interview, and y = 1 if the interviewee indicated that the membership will not be renewed at 
that amount of dues increase and 0 if the membership will be renewed. 


/: 


2 


3 


28 


29 


30 


X,: 


30 

0 


30 30 

1 0 


49 

0 


50 50 

1 1 


Logistic regression model (14.20) is assumed to be appropriate* 

a. Find the maximum likelihood estimates of jS 0 and P \. State the fitted response function. 

b. Obtain a scatter plot of the data with both the fitted logistic response function from part 
(a) and a lowess smooth superimposed. Does the fitted logistic response function appear 
to fit well? 

c. Obtain exp(h) and interpret this number. 

d. What is the estimated probability that association members will not renew their membership 
if the dues are increased by $40? 

e. Estimate the amount of dues increase for which 75 percent of the members are expected 
not to renew their association membership. 



626 Part Three Nonlinear Re^ivssion 


14.8. Refer to Annual dues Problem 14.7. 

a. Fit a probit mean response function (14.12) to the data. Qualitatively compare the fit her 
with ihe logistic fit obtained in part (a) of Problem 14.7. What do you conclude? 

b. Fit a complimentary log-log mean response function (14.19) to the data. Qualitative! 
compare the fit here with the logistic fit oblained in part (a) of Problem 14.7. What do you 
conclude? 


14.9. Performance ability. A psychologist conducted a study to examine the nature of the relation 
iTany, between an employee's emotional stability (X) and the employee's ability topeiforni 
in a task group (K). Emotional stability was measured by a written test for which the hiwher 
the score, ihe greater is the emotional stability. Ability to perform in a task group (y — | ^ 
able, Y = 0 if unable) was evaluated by the supervisor. The results for 27 employees were- 


J: _]_2_B_ ... 25 _26_27 

X,-: 474 4B2 45B ... 562 506 600 

Vi ： 0 0 0 1 0 1 


Logistic regression model (14.20) is assumed to be appropriate. 

a. Find the maximum likelihood estimates of j6 0 and 私 ■ State the fitted response function. 

b. Obtain a scatter plot of the data with both the fitted logistic response function from part (a) 
and a low ess smooth superimposed. Does the fitted logistic response function appear to fit 
well? 

c. Obtain cxp(f?i ) and interpret this number. 

cl. What i.s the estimated probability that employees with an emotional stability test score of 
550 will be able to perform in a task group? 

e. Estimate the emotional stability test score for which 70 percent of the employees with this 
test score are expected to be able to perform in a task group. 


14.10. Refer to Performance ability Problem 14.9. 

a. Fit a probit mean response function (14.12) to the data. Qualitatively compare the fit here 
with the logistic fit obtained in part (a) of Problem 14.9. What do you conclude? 

b. Fit a complementaiy log-log mean response function (14.19) to the data. Qualitatively 
compare the fit heiie with the logistic fit obtained in part (a) of Problem 14.9. What do you 
conclude? 

*14.11. Bottle return. A carefully controlled experiment was conducted to study the effect of the 
size of the deposit level on the likelihood that a returnable one-liter soft-drink bottle will be 
returned. A bottle return was scored 1, and no return was scored 0. The data to follow show 
the number of bottles that were returned (Ky) out of 500 sold (/? ; ) at each of six deposit levels 
(Xj, in cents): 


/: 

1 

2 

3 

A 

5 

6 

Deposit level X ; -: 

2 

5 

10 

20 

25 

30 

Number sold n,-: 

500 

500 

500 

500 

500 

500 

Number returned Y ，: 

72 

103 

170 

296 

406 

449 


An analyst believes that logistic regression model (14.20) is appropriate for studying the 
relation between size of deposit and the probability a bottle will be returned. 

a. Plot the estimated proportions pj = Yj/n } against Xj. Does the plot support the analyst s 
belief that the logistic response function is appropriate? 

b. Find the maximum likelihood estimates of j6o and j£J|. State the fitted response function. 




Chapter 1 A Logistic Regression, Poisson Regression, and. Generalized Linear Models 627 

c. Obtain a scatter plot of the data with the estimated proportions from part (a), and super¬ 
impose the fitted logistic response function from part (b). Does the fitted logistic response 
function appear to fit well? 

d. Obtain exp(^i) and interpret this number. 

e. What is the estimated probability that a bottle will be returned when the deposit is 15 cents? 

f. Estimate the amount of deposit for which 75 percent of the bottles are expected to be 
returned. 

14.12. Toxicity experiment. In an experiment testing the effect of a toxic substance, 1,500 experi¬ 
mental insects were divided at random into six groups of 250 each. The insects in each group 
were exposed to a fixed dose of the toxic substance. A day later, each insect was observed. 
Death from exposure was scored 1, and survival was scored 0. The results are shown below; 
Xj denotes the dose level (on a logarithmic scale) administered to the insects in group j and 
Yj denotes the number of insects that died out of the 250 (nj) in the group. 


/•: 

1 

2 

3 

4 

5 

6 

入厂 

1 

2 

3 

4 

5 

6 

n 厂 

250 

250 

250 

250 

250 

250 

K ,-： 

■'28 

53 

93 

126 

172 

197 


Logistic regression model (14.20) is assumed to be appropriate. 

a. Plot the estimated proportions pj = Yj/fij against Xj. Does the plot support the analyst’s 
belief that the logistic response function is appropriate? 

b. Find the maximum likelihood estimates of j6o and ^\. State the fitted response function. 

c. Obtain a scatter plot of the data with the estimated proportions from part (a), and super¬ 
impose the fitted logistic response function from part (b). Does the fitted logistic response 
function appear to fit well? 

d. Obtain exp(^i) and interpret this number. 

e. What is the estimated probability that an insect dies when the dose level is X 二 3.5? 

f. What is the estimated median lethal dose — that is, the dose for which 50 percent of the 
experimental insects are expected to die? 

14.13. Car purchase. A marketing research firm was engaged by an automobile manufacturer to 
conduct a pilot study to examine the feasibility of using logistic regression for ascertaining 
the likelihood that a family will purchase a new car during the next year. A random sample of 
33 suburban families was selected. Data on annual family income (X Is in thousand dollars) 
and the current age of the oldest family automobile (X 2 , in years) were obtained. A follow¬ 
up interview conducted 12 months later was used to determine whether the family actually 
purchased a new car {7 = 1) or did not purchase a new car (7 = 0) during the year. 


1 ： 

1 

2 

3 

* * » 

31 

32 

33 

Xn ： 

32 

45 

60 


21 

32 

17 

XfZ： 

B 

2 

2 

, » » 

3 

5 

1 

K /： 

0 

0 

1 

... 

0 

1 

0 


Multiple logistic regression model (14.41) with two predictor variables in first-order terms is 
assumed to be appropriate. 

a. Find the maximum likelihood estimates of /3o» Pi, and fh- State the fitted response function. 

b. Obtain exp(^i) and exp(^) and interpret these numbers. 

c. What is the estimated probability that a family with annual income of $50 thousand and 
an oldest car of 3 years will purchase a new car next year? 





628 Part Three Nonlinear Regression 


Multiple logistic regression model (14.41) with three predictor variables in Rrel-oitler terms 
is assumed to be appropriate ， 

a. Find the maximum likelihood estimates of 仇， / 6 2 , and State the fitted response 
function, 

b. Obtain expf^i), exp(/? 2 ), and exp(/? 3 ), Interpret these numbers, 

c. What is the estimated probability that male clients aged 55 with a health awareness index 
of 60 will receive a flu shot? 

Refer to Annual dues Problem 14.7. Assume that the fitted model is appropriate and that 
large-sample inferences are applicable. 

a. Obtain an approximate 90 percent confidence interval for exp ( 卢 i). Interpret your interval. 

b. Conduct a Wald test to determine whether dollar increase in dues (X) is related to the 
probability of membership renewal; use a = .10. State the alternatives, decision rule, and 
conclusion. What is the approximate P-value of the test? 

c. Conduct a likelihood ratio tesl to determine whether dollar increase in dues (Jf) is related 
to the probability of membership renewal; use a = .10. State the full and reduced models, 
decision rule, and conclusion. What is the approximate 尸 -value of the test? How does the 
result heie compare to that obtained for the Wald test in part (b)? 

Refer to Performance ability Problem 14.9. Assume that the fitted model is appropriate and 
that large-sample inferences are applicable. 

a. Obtain an approximate 95 percent confidence interval for exp(jSi). lnteipret yew interval. 

b. Conduct a Wald test to determine whether employee’s emotional stability (X) is related 
to the probability that the employee will be able to perform in a task group: use a = .05, 
State the alternatives, decision rule, and conclusion. What is the approximate f-value of 
the test? 

c. Con duel a likelihood ratio test to determine whether employee's emotional stability (X) 
is related to the probability that the employee will be able to peiform in a task group ； 
use a = .05. State the full and reduced models, decision rule, and conclusion. What is the 
approximate P-value of the test? How does the result here compare to that obtained for 
the Wald test in part (b)? 

Refer to Bottle return Problem 14.11. Assume that the fitted model is appropriate and that 
large-sample inferences are applicable. 

a. Obtain an approximate 95 percent confidence interval for 卢卜 Convert this confidence 
interval into one for the odds ratio. Interpret this latter interval. 


Flu shots. A local health clinic seni fliers to its clients to encourage everyone, butes - ’ 

older persons at high risk of complications, to get a flu shot in time for protection ae . 1 

expected flu epidemic. In a pilot follow-up study, 159 clients were randomly selected and i / 1 "■ 
whether they actually received a flu shot. A client who received a flu shot was coded y 閃 ? 
and a client who did not receive a flu shot was coded Y = 0. In addition, data werecoU ' * 
on their age (^i) and their health awareness. The latter data were combined into a health 
awareness index (X 2 ), for which higher values indicate greater awareness. Also included' 
the data was client gender, where males were coded X 3 = 1 and females were ctxled 


2 


3 


157 


158 


159 


*14.14, 


*14.15. 


14.16. 


*14,17, 


3 6 T. T. 
7 5 


8 2 
6 3 


76221 


82 


1 o T. 


15 1 

6 5 


9 2 
5 5 




VM- 




Chapter 14 Logistic Regression, Poisson Regression, and Generalized Linear Models 629 


b. Conduct a Wald test to determine whether deposit level (X) is related to the probability 
that a bottle is returned; use a = .05. State the altmiatives, decision rule, and conclusion. 
What is the approximate P-value of the test? 

c. Conduct a likelihood ratio test to determine whether deposit level (X) is related to the 
probability that a bottle is returned; use a = ,05. State the full and reduced models, 
decision rule, and conclusion. What is the approximate T 3 -value of the test? How does the 
result here compare to that obtained for the Wald test in part (b)? 

14.18. Refer to Toxicity experiment Problem 14.12. Assume that the fitted model is appropriate and 

that large-sample inferences are applicable. 

a. Obtain an approximate 99 percent confidence interval for Convert this confidence 
interval into one for the odds ratio. Interpret this latter interval. 

b. Conduct a Wald test to determine whether dose level (X) is related to the probability that 
an insect dies; use or = .01. State the alternatives, decision rule, and conclusion. What is 
the approximate P-value of the test? 

c. Conduct a likelihood raiio test to determine whether dose level (X) is related to the prob¬ 
ability that an insect dies; use a = .01. State the full and reduced models, decision rule, 
and conclusion. What is the approximate P-value of the test? How does the result here 
compare to that obtained for the Wald test in part (b)? 


14.19. Refer to Car purchase Problem 14.13. Assume that the fitted model is appropriate and that 

large-sample inferences are applicable. 

a. Obtain joint confidence intervals for the family income odds ratio exp(20jS!) for families 
whose incomes differ by 20 thousand dollars and for the age of the oldest family automobile 
odds ratio exp( 2 j@ 2 ) for families whose oldest automobiles differ in age by 2 years, with 
family confidence coefficient of approximately .90. Interpret your intervals. 

b. Use the Wald test to determine whether X 2 , age of oldest family automobile, can be 
dropped from the regression model; use a = .05. State the alternatives, decision rule, and 
conclusion. What is the approximate P-value of the test? 

c. Use the likelihood ratio test to determine whether X 2 , age of oldest family automobile, can 
be dropped from the regression model; use a = .05. State the full and reduced models, 
decision rule, and conclusion. What is the approximate P-value of the test? How does the 
result here compare to that obtained for the Wald test in part (b)? 

d. Use the likelihood ratio test to determine whether the following three second-order terms, 
the square of annual family income, the square of age of oldest automobile, and the two- 
factor interaction effect between annual family income and age of oldest automobile, 
should be added simultaneously to the regression model containing family income and age 
of oldest automobile as first-order terms; use a = .05. State the full and reduced models ， 
decision rule, and conclusion. What is the approximate 尸 -value of the test? 

*14.20. Refer to Flu shots Problem 14.14. 

a. Obtain joint confidence intervals for the age odds ratio exp(30/3i) for male clients whose 
ages differ by 30 years and for the health awareness index odds ratio exp(25 卢 2 ) for male 
clients whose health awareness index differs by 25, with family confidence coefficient of 
approximately .90. Interpret your intervals. 

b. Use the Wald test to determine whether X 3 , client gender, can be dropped from the regres¬ 
sion model; use a = .05. State the alternatives, decision rule, and conclusion. What is the 
approximate 尸 -value of the test? 

c. Use the likelihood ratio test to determine whether X 3 , client gender, can be dropped from 
the regression model; use or = .05. State the full and reduced models, decision rule, and 



630 Part Three Nonlinear Rec ession 


conclusion. Whal is the appmximale P-value of* the lest*? How does ihe result here 
lo that oblained for llie Wald lesl in pail (b)7 


Compaq 


d. Use ihc likelihood raiio test to deiermine whether ihc following three secondorder t 
the square ol age, the square of henlili awareness index, anti ihc twof.acior interaction^^’ 
between age and health awareness index, should be added simulianeously to the regret 
sion model conlaining age and health awareness index as first-older terms; use « — Q 5 
State the alternatives, full and reduced models, decision rule, and conclusion. Whatis’the 
approximate P-value of the test? 


14.21. Refer to Car purchase Problem 14,13 where the pool of predictors consisis of all fii^t-onler 

terms and all second-oider terms in annual family income and age of oldest family automobile 

a. Use forward seleclion to decide which piedictor variables enter into ihe regression model 
Control ihe a ri skat .10 at each stage. Which variables are eniei'ed into the regression model? 

b. Use backward elimination to decide which predictor variables Ctin be dropped from the 
legre.s.sion model. Control the a risk at .10 at each stage. Which variables are retained? 
How does this compare to your results in pait (a)? 

c. Find ihe best model according to the AfC {> criterion. How does this compare toyourresutte 
in parts (a) and (b)? 

d. Find the best model according to the SBC p criterion. How does this compare to yourresulte 
in parts fa), (b) and (c)? 


* 14.22. Refer to Flu shots Problem 14,14 where the pool of predictors con.sists of all first-order terms 
and all seconcl-ordei' tei ms in age and health awareness index. 

a. Use forward selection to decide which predictor variables enter inlo the regression model, 
Control the a risk at. 10 at each stage. Which variables are entered i nto the regression model? 

b. Use backward eliminalion to decide which predictor variables can be droppetl from the 
regres-sion model. Control the a risk at .1() at each .stage. Which variables are retained? 
How does this compare to your results in part (a)? 

c. Find the best model according lo the A/C ;J crilerion. How does this compare to your results 
in parls (li) and (b)? 

d. Find the best model according to the SBC f criterion. How does thi.s compare to yourresulte 
in parts (a), (b) and (c)? 

*14.23, Refer to Bottle return Problem 14.11. Use the groups given there to conduct a chi-square 
goodness of fit test of the appropriateness of logistic regression model (14.20). Control the 
risk of a Type 1 error at .01. Stale the alternatives, decision rule, and conclusion. 

14.24, Refer to Toxicity experiment Problem 14.12. Use the groups given there to conduct a deviance 
goodness of (it test of the appropriateness of logistic regression model (14.20). Control the 
lisk of a Type 1 error at .01. State the ahernative.s, decision rule, and conclusion. 

*14.25. Refer lo Annual dues Problem 14,7. 

a. To assess the appropriateness of the logistic regression function, form three groups of 
10 cases each according to their fitted logit values it'. Plot ihe estimated proportions p-, 
against the midpoints of the 亓 ’ intervals, fs the plot consistent with a response I'unction of 
monotonic sigmoidal shape? Explain. 

b. Obtain the studentizecl Peaixoii residuals (14.81) and plot them against the estimated model 
probabilities with a lowers smooth .superimposed. What does the plot suggest about the 
adequacy of the fit of the logistic i'egre_ssion model? 


14.26. Refer lo Performance ability Problem 14.9. 

a. To assess the appropriateness of the logistic regression function, lorm three groups cf 
nine ca.ses each according to their fitted logit values it'. Plot ihe estimated proportions Pj 



Chapter 1A Logistic Regression, Poisson Regression, and Generalized Linear Models 631 


against the midpoints of the ft' intervals. Is the plot consistent with a response function of 
monotonic sigmoidal shape? Explain. 

b. Obtain the deviance residuals (14.83) and plot them against the estimated model probabil¬ 
ities with a lowess smooth superimposed. What does the plot suggest about the adequacy 
of the fit of the logistic regression model? 

14.27. Refer to Car purchase Problems 14.13 and 14.21. 

a. To assess the appropriateness of the logistic regression model obtained in part (d) of 
Problem 14.21, form three groups of 11 cases each according to their fitted logit values 
7 t f . Plot the estimated proportions pj against the midpoints of the ft' intervals. Is the plot 
consistent with a response function of monotonic sigmoidal shape? Explain. 

b. Obtain the studentized Pearson residuals (14.81) and plot them against the estimated model 
probabilities with a lowess smooth superimposed. What does the plot suggest about the 
adequacy of the fit of the logistic regression model? 

* 14.28. Refer to Flu shots Problems 14.14 and 14.22、 

>v t 

a. To assess the appropriateness of the logistic regression model obtained in part (d) of 
Problem 14.22, form 8 groups of approximately 20 cases each according to their fitted 
logit values tv'. Plot the estimated proportions pj against the midpoints of the n 1 in¬ 
tervals. Is the plot consistent with a response function of monotonic sigmoidal shape? 
Explain. 

b. Using the groups formed in part (a), conduct a HosmCT-Lemeshow goodness of fit test for 
the appropriateness of the logistic regression function; use a = .05. State the alternatives, 
decision rule, and conclusions. What is the P-value of the test? 

c. Obtain the deviance residuals (14.83) and plot them against the estimated model probabil¬ 
ities with a lowess smooth superimposed. What does the plot suggest about the adequacy 
of the fit of the logistic regression model? 

*14.29. Refer to Annual dues Problem 14.7. 

a. For the logistic regression model fit in Problem 14.7a, prepare an index plot of the diag¬ 
onal elements of the estimated hat matrix (14.80). Use the plot to identify any outlying 
X observations. 

b. To assess the influence of individual observations, obtain the delta chi-square statistic 
(14.85), the delta deviance statistic (14.86), and Cook’s distance (14.87) for each obser¬ 
vation. Plot each of these in separate index plots and identify any influential observations. 
Summarize your findings. 

14.30. Refer to Performance ability Problem 14.9. 

a. For the logistic regression fit in Problem 14.9a, prepare an index plot of the diagonal 
elements of the estimated hat matrix (14.80). Use the plot to identify any outlying X 
observations. 

b. To assess the influence of individual observations, obtain the delta chi-square statistic 
(14.85), the delta deviance statistic (14.86), and Cook’s distance (14.87) for each obser¬ 
vation. Plot each of these in separate index plots and identify any influential observations. 
Summarize your findings. 

14.31. Refer to Car Purchase Problems 14.13 and 14.21. 

a. For the logistic regression model obtained in part (d) of Problem 14.21, prepare an index 
plot of the diagonal elements of the estimated hat matrix (14.80). Use the plot to identify 
any outlying X observations. 

b. To assess the influence of individual observations, obtain the delta chi-square statis¬ 
tic (14.85), the delta deviance statistic (14.86), and Cook’s distance (14.87) for each 



632 Part Three Nonlinear Regression 


•卞 


obsei-vation. Plot each of these in separate index plois and identify any influential 
servations. Summarize your findings. 

1432. Refer to Flu shots Problem 14 、 14> 

For the logistic regression fil in Problem 14.14a. prepare an index plot of the di 
elements or the estimated hat matrix (14,80). Use the plot to idenlify any outlv^v 


a 、 


observations. 


b. 


To assess the influence of individual observations, oblam the delta chi-square statist 
(14.85), the delta deviance statistic (14.86), and Cook's dislancc (14.87) for each obser 
vation. Plot each of these in .separate index plots and identify any influential obsei-vation^ 
Summarize your findings. 


*14.33. Refer to Annual dues Problem 14,7. 

a. Based on the fitted regression function in Problem 14.7a, obtain an approximate 90 percent 

confidence interval for the mean response jr„ for a due.s inci^iLse of X h = $4(), - 

b. A prediciion rule is to be developed, based on the fitted regression function in Prol>_ ； 
lem 14.7a. Based on the sample cases, find the total error rate, the error rate for renewers ' 
and the eiror rate for nonrenewers for the following cutoffs: .40, .45. ,50, ,55, .60 、 

c. Based on your results in part (b), which cutoff minimize.s the total error rate? Are the 
error rates for renewers and nonrenewers fairly balanced at this cutoff? Obtain the area 
under the ROC curve lo assess the model’s predictive power here. What do jou 
conclude? 

cl. How can you establish whether the oKseivecl total error rate Cor the best cutoff in part (b) is ; 
a reliable indicator of the predictive ability of the fitled regress ion function and the chosen 
cutofT? 


L4.34. Refer to Performance ability Problem 14,9, 

a. Using the fitted regre.s.sion function in Problem 14.9a, obtain joint confidence intervals for 
the mean response jr /; for persons with emotional .stability test scorns X,, = 550 and 625, 
respectively, with an approximate 90 percent family confidence coefficient. Interpret your 
intervals. 

b. A prediction rule, based on the fitted regre.ssion function in Problem 14.9a, is to be de¬ 
veloped. For llie sample cases, find the total error rate, the error rate for employees able 
to perform in a task group, and the error rate for employees not able to perform for the 
following cutoffs; .325, .425, .525, .625. 

c . On the basis of your results in part (b). which cutoff minimizes the total erroi' i'ate? Are 
the error rates for employees able to perform in n task group and for employees not able to 
perform fairly balanced at ihis cutoff? Obtain the area under the ROC curve to assess the 
model’s predictive power here. What do you conclude? 

d. How can you establish whether the observed total error rate for the beijt cutoff in part (c) is 

a reliable indicator of the predictive ability of the fitted resression function and the chosen 
cutoff? ^ 


14.35. Refer to Bottle return Problem 14. M. 

a. For the fitted regression function in Problem 14.1 la. obtain an approximate 95 percent 
confidence interval for the probability of a purchase for deposit X„ — 15 cents. Interpret 
your interval. 

b 、 A prediction rule is to be developed ba.setl on the fitted regression function in Prob* 
lem 14.11 a. For the sample cases, find ihe total error rale, the erroi 1 rate foi - purcliasei's, and 
the error rate for nonpurchasers for the following cutoffs: 、 15() ， .300. .450, .600, .750. 



ChapteM 4 Logistic Regression, Poisson Regression, and Generalized Linear Models 633 


c. According to your results in part (b)» which cutoff minimizes the total error rate? Are 
the error rates for purchasers and nonpurchasers fairly balanced at this cutoff? Obtain 
the area under the ROC curve to assess the model’s predictive power here. What do you 
conclude? 

d. How can you establish whether the observed total error rate for the best cutoff in part (c) is 
a reliable indicator of the predictive ability of the fitted regression function and the chosen 
cutoff? 

*14.36. Refer to Flu shots Problem 14.14. 

a. On the basis of the fitted regression function in Problem 14.14a, obtain a confidence 
interval for the mean response 7 T h fora female whose age is 65 and whose health awareness 
index is 50, with an approximate 90 percent family confidence coefficient. Interpret your 
intervals. 

b. A prediction rule is to be based on the fitted regression function in Problem 14.14a. For 
the sample cases, find the tptal error rate, the error rate for clients receiving the flu shot, 
and the error rate for clients not receiving the flu shot for the following cutoffs: .05, .10, 
.15, .20. 

c. Based on your results in part (b), which cutoff minimizes the total error rate? Are the error 
rates for clients receiving the flu shot and for clients not receiving the flu shot fairly balanced 
at this cutoff? Obtain the area under the ROC curve to assess the model's predictive power 
here. What do you conclude? 

d How can you establish whether the observed total error rate for the best cutoff in part (c) is 
a reliable indicator of the predictive ability of the fitted regression function and the chosen 
cutoff? 

14.37. Polytomous logistic regression extends the binary response outcome to a multicategoiy re¬ 
sponse outcome for either nominal level or ordinal level data. Discuss the advantages and 
disadvantages of treating multicategory ordinal level outcomes as a series of binary logistic 
regression mcxiels, as a nominailevel polytomous regression model, or as a proportional odds 
model. 

* 14.38. Refer to Airfreight breakage Problem 1.21. 

a. Fit the Poisson regression model (14.113) with the response function fi(X, p)= 
exp(^o + PiX). State the estimated regression coefficients, their estimated standard devi¬ 
ations, and the estimated response function. 

b. Obtain the deviance residuals and present them in an index plot. Do there appear to be any 
outlying cases? 

c. Estimate the mean number of ampules broken when X = 0,1,2,3. Compare these estimates 
with those obtained by means of the fitted linear regression function in Problem 1.21a. 

d. Plot the Poisson and linear regression functions, together with the data. Which regression 
function appears to be a better fit here? Discuss. 

e . Management wishes to estimate the probability that 10 or fewer ampules are broken when 
there is no transfer of the shipment. Use the fitted Poisson regression function to obtain 
this estimate. 

f. Obtain an approximate 95 percent confidence interval for /3 t . Interpret your interval 
estimate. 

14.39. Geriatric study. A researcher in geriatrics designed a prospective study to investigate the 
effects of two interventions on the frequency of falls. One hundred subjects were randomly 
assigned to one of the two interventions: education only (X t 二 0) and education plus aerobic 
exercise training = 1). Subjects were at least 65 years of age and in reasonably good health. 



634 Part Three Nonlineur Rc^nwxhtn 


Three variables considciTed to be importanl as control variables wci^e gender (X ->： 0=rf ettla j 
1 = male), a balance index (X\), and a strength index (^ 4 ). The higher 1 he balance index th' 
moi e stable is ihe .subject; and the higher ihe sirengih index, the sironger is the subject 
subject kept a diary recording the number of falls (K) during ihe six months of the study The 
data follow: 


Number of 


Subject 

/ 

Falls 

n 

Intervention 

Xn 

Gender 

X/2 

Balance Index 

入 /3 

Strength Index 
X ;4 

i 

1 

1 

0 

45 

^0 " 

2 

1 

1 

0 

62 

66 

3 

2 

1 

1 

43 

64 

98 

4 

0 

0 

69 

、 * 、 

48 

99 

4 

0 

1 

50 

52 

100 

2 

0 

0 

37 

56 

a. Fit ihe 

Poisson 

I'egiTession model 

(14.113) with 

ihe I'esponse 

function /^(X,p) = 


exp(jf} () +/J|X[ + fi 2 X 2 + + AX 4 ). State the estimated regression coelTicients, their 

estimated standard deviations, and the estimated response function. 

b. Obtain the deviance i^esiduals and present them in an index plot. Do there appear to beany 
outlying cases'? 

c. A.ssuming that the fitted model is appropriate, use the likelihood ratio test to determine 
whether gender (X 2 ) can be dropped from the model : control a at .05. Stale the full and 
reduced models, decision rule, and conclusion. Whai is the P-value of the tesi. 

d. For the fitted model containing only X|, X 5 . and X 4 in first-order terms, obtain an ap¬ 
proximate 95 percent confidence interval for fi \. [nterpmt your confidence interval. Does 
aerobic exercise reduce the frequency of falls when controlling for balance and strength? 


Exercises 


14.40. Show the equivalence oi (14,16) and (14.17), 

14.41. Derive (14.34) from (14.26). 

14.42. Derive (14,18a), using (14.16) and (14.18), 

14.43. (Calculus needed.) Maximum likelihood estimation theory states that the estimated lai^e- 
sample variance-covariance matrix for maximum likelihood estimators is given by the inverse 
of the information matrix, the elements of which are the negatives of the expected values of the 
second-order partial derivatives of the logarithm of the likelihood function evaluated at P = b: 

f a 2 log,, “P) ! _， 

K 九 =J 

Show that lliis matrix simplifies to (14.51) for logistic regression. Consider the case where 

p-\ = \. 

14.44. (Calculus needed.) Estimate the approximate variance-covariance matrix of the e.stimated re¬ 
gression coefficients lor the programming task example in Table 14. la, using (14.51), and 
verify the estimated standard deviations in Table 14.lb, 

14.45. Show that the logistic response funclion (13.10) reduces to the response function in (14.20) 
when the Y, are independent Bernoulli random variables with E{Y；\ = jr,-. 

14.46. Consider ihe multiple logistic regression model with X’P = /}(，+ &&+ 办十 
Derive an expression for the odds ratio Tor X \. Does exp(/J,) have tile same meaning here as 
fora regression model contiiining no interaction term? 




Chapter 14 Logistic Regression, Poisson Regression，and Generalized Linear Models 635 


14.47. A Bernoulli response Yf has expected value: 


E{Y t } =7ti — \ — exp 



Show that the link function here is the complementary log-log transformation of 7T 卜 namely, 
10g e [- !og e (l - 7 T /)]. 


Projects 


14.48. Refer to the Disease outbreak data set in Appendix C.10. Savings account status is the 
response variable and age, socioeconomic status, and city sector are the predictor variables. 
Cases 1—98 are to be utilized for developing the logistic regression model. 

a. Fit logistic regression model (14.41) containing the predictor variables in first-order terms 
and interaction terms for all pairs of predictor variables. State the fitted response function. 

V 、 

b. Use the likelihood ratio test to determine whether all interaction terms can be dropped 
from the regression model; use a — .01. State the alternatives, full and reduced models, 
decision rule, and conclusion. What is the approximate P-value of the test? 

c. For logistic regression model in part (a), use backward elimination to decide which predictor 
variables can be dropped from the regression model. Control the or risk at .05 at each stage. 
Which variables are retained in the regression model? 

14.49. Refer to the Disease outbreak data set in Appendix C. 10 and Project 14.48. Logistic regression 
model (14.41) with predictor variables age and socioeconomic status in first-order terms is to 
be further evaluated. 

a. Conduct the Hosmer-Lemeshow goodness of fit test for the appropriateness of the logistic 
regression function by forming five groups of approximately 20 cases each; use a = .05. 
State the alternatives, decision rule, and conclusion. What is the approximate P-value of 
the test? 

b. Obtain the deviance residuals and plot them against the estimated probabilities with a 
lowess smooth superimposed. What does the plot suggest about the adequacy of the fit of 
the logistic regression model? 

c. Prepare an index plot of the diagonal elements of the estimated hat matrix (14.80). Use the 
plot to identify any outlying X observations. 

d. To assess the influence of individual observations, obtain the delta chi-square statistic 
(14.85), the delta deviance statistic (14.86), and Cook’s distance (14.87) for each obser¬ 
vation. Plot each of these in separate index plots and identify any influential observations. 
Summarize your findings. 

e. Construct a half-normal probability plot of the 此 solute deviance resi duals and superimpose 
a simulated envelope. Are any cases outlying? Does the logistic model appear to be a good 
fit? Discuss. 

f. To predict savings account status, you must identify the optimal cutoff. On the basis of the 
sample cases, find the total error rate, the error rate for persons with a savings account, 
and the error rate for persons with no savings account for the following cutoffs: ,45, .50, 
.55, .60. Which of the cutoffs minimizes the total error rate? Are the two error rates for 
persons with and without savings accounts fairly balanced at this cutoff? Obtain the area 
under the ROC curve to assess the model’s predictive power here. What do you conclude? 

14.50. Refer to the Disease outbreak data set in Appendix C.10 and Project 14.49. The regression 
model identified in Project 14.49 is to be validated using cases 99-196. 



636 Part Three Nonlinear Regression 




a. Use tile rule obtained in Project 14.491" to make a prediction for each of the holdi 


cases. What are the total and the iwo component pi'ecliction error rates for the 


Ou tvalidati 0li 


Val 'dation? 


data set? How do these error rates compare with those for the model-building data 
Project 14.49H 崎 

b. Combine ihe model-buikling and validaiion daia sels and lit ihe model identified ' 
Project 14.49 to ihe combined data. Are the e.stimaied coefficients and their estima)； 10 
standard deviations similar to those obtained lor the moclel-building claki set? Should th ' 


be ? Comment. 


c. 


14.51. 


Based on the fitted regression model in pari (b). obtain joint 90 percent confidence interval 
for the odds ratios f'oi' age and socioeconomic status. Imerpret your intervals. 

Refer to the SEN1C dala set in Appendix C. I. Medical school affiliation is the response 
variable，to be coded K = 1 if medical school affiliation and K = () if no medical school 
affiliation. The pool of potential predictor variables includes age, routine chest X-ray ratio 
average daily census, and number of nui-ses. All 113 cases are to be used in developing the 
logistic regression model. 


a. 


Fit logistic regression model (14.41) containing all predictor variables in the pod in 
order terms and interaction terms for all pairs of predictor variables. Stale the fitted response 


function. 


b. Test whether all interaction terms can be dropped from the regi ession model; use a = .05: 
State the full and reduced models, decision rule, and conclusion. What is the approximaie 
P-value of the test? 


c. For logistic regression model (14.41) containing the predictor variables in first-order terms 
only, use forward stepwise regression to decide which predictor variables can be retained 
in the regression model. Control the a risk at. 10 at each stage. Which variables should be 
retained in the regi'ession model? 

d. For logistic regression model {14.41) containing the predictor variables in first-order terms 
only, identify the best subset models using the AIC (> criterion and the SBC f , criterion. Does 
the use of these two ciiteria lead to the .same model? Are either of the models identified 
the same as that found in part (c)? 


14.52. Refer to the SENIC data set in Appendix C.l and Project 14.51. Logistic regression 
model (14.41) with predictor variables age and average daily census in first-order teims is to 
be further evaluated. 


a. Conduct Hosmei-Lemshow goodness of fit test for the appropriateness of the logistic re¬ 
gression function by forming five groups of approximately 23 cases each: use a = .05. State 
the alternatives, decision rule, and conclusion. What is the approximate P-valueofthe tes^ 

b. Obtain the deviance residuals and plol them against the estimated probabilities with a 
lowess smooth superimposed. What does the plol suggest about the adequacy of the fit of 
the logistic regression model? 

c. Construct ci half-nornicil probability plot of tile absoluie deviance residuals and superim¬ 
pose a simulated envelope. Are any cases outlying 1 ? Does ihe logistic model appeal' to be 
a good fit? Discuss. 

cl. Prepare an index plot of ihe diagonal elements of the estimated hat matrix (f 4 . 8 ())- Use the 
plot to identify any outlying X observations. 

e. To assess tile influence of individual observations, obtain the delta chi-square statistic 
(14.85), the delta deviance statistic (14.86), and Cook's distance (14.87) for each d)ser- 
vation. Plot each of these in separate index pfot.s and idcnlify any inlluenlial observations. 
Summarize your findings. 



Chapter 14 Logistic Regression, Poisson Regression, and Generalized Linear Models 637 


f. To predict medical school affiliation, you must identify the optimal cutoff. For the sample 
cases, find the total error rate, the error rate for hospitals with medical school affiliation, 
and the error rate for hospitals without medical school affiliation for the following cutoffs: 
.30, .40, .50, .60. Which of the cutoffs minimizes the total error rate? Are the two error 
rates for hospitals with and without medical school affiliation fairly balanced at this cutoff? 
Obtain the area under the ROC curve to assess the model’s predictive power here. What 
do you conclude? 

g. Estimate by means of an approximate 90 percent confidence interval the odds of a hospital 
having medical school affiliation for hospitals with average age of patients of 55 years and 
average daily census of 500 patients. 

14.53. Refer to Annual dues Problem 14.7. Obtain a simulated envelope and superimpose it on the 
half-normal probability plot of the absolute deviance residuals. Are there any indications that 
the fitted model is not appropriate? Are there any outlying cases? Discuss. 

14.54. Refer to Annual dues Problem 14.7. In order to assess the appropriateness of laige-sample 

inferences here, employ the following parametric bootstrq) procedure: For each of the 30 cases, 
generate a Bernoulli outcome (0, 1), using the estimated probability fti for the original X ( 
level according to the fitted model. Fit the logistic regression model to the bootstrap sample 
and obtain the bootstrap estimates and b*. Repeat this procedure 500 times. Compute 
the mean and standard deviation of the 500 bootstrap estimates b^, and do the same for b*. 
Plot separate histograms of the bootstrap distributions of and b*. Are these distributions 
approximately normal? Compare the point estimates bo and b\ and their estimated standard 
deviations obtained in the original fit to the means and standard deviations of the bootstrq) 
distributions. What do you conclude about the appropriateness of laige-sample inferences 
here? Discuss. . 

14.55. Refer to Car purchase Problem 14.13. Obtain a simulated envelope and superimpose it on 
the half-normal probability plot of the absolute deviance residuals. Are there any indications 
that the fitted model is not appropriate? Are there any outlying cases? Discuss. 

14.56 - Refer to Car purchase Problem 14.13. In order to assess the appropriateness of laige-sample 
inferences here, employ the following parametric bootstr 叩 ping procedure: For each of the 
33 cases, generate a Bernoulli outcome (0,1)，using the estimated probability for the original 

levels of the predictor variables according to the fitted model. Fit the logistic regression model 
to the bootstrap sample. Repeat this procedure 500 times. Compute the mean and standard 
deviation of the 500 bootstrap estimates and do the same for b^. Plot separate histograms 
of the bootstrap distributions of b\ and b^. Are these distributions approximately normal? 
Compare the point estimates b { and ^2 and their estimated standard deviations obtained in the 
original fit to the means and standard deviations of the bootstrap distributions. What do you 
conclude about the appropriateness of large-sample inferences here? Discuss. 

14.57. Refer to the SENIC data set in Appendix C.l. Region is the nominal level response variable 
coded 1 = NE, 2 = NC,3 =S, and 4=W. The pool of potential predictor variables includes age, 
routine chest X-ray ratio, number of beds, medical school affiliation, average daily census, 
number of nurses, and available facilities and services. All 113 hospitals are to be used in 
developing the polytomous logistic regression model. 

a. Fit polytomous regression model (14.99) using response variable region with 1 = NE as 
the referent category. Which predictors appear to be most important? Interpret the results. 

b. Conduct a likelihood ratio test to determine if the three parameters corresponding to age 
can be dropped from the nominal logistic regression model. Control a at .05. State the full 
and reduced models, decision rule, and conclusion. What is the approximate P-value of 
the test? 



638 Part Three Nonlinear Recession 


c. Conduct a likelihood ratio lest to clctcnnine if* all paramelens corresponding to a 

iivaikible facililic.s and services can be dropped from the nominal logi.slic rem-essioTi m h 
C ontrol a at .05. Stale the liill and reduced models, decision rule, and conclusion What. 1 ' 
the approximale P-value of the lestV ^ 

d. For the full model in pari (a), cany out separate binary logislic rcgi'c.s.sion.s for each of th 
three comparisons with the ret ere ni culegory, as described at ihc lop of page 612 How^ 
the .slope coeflicients compare U? lho.se obmined in part (a). 

c. For each of the separate binary logi.stic regressions carrietl out in pari (<J), obtain the 
deviance residuals and plot ihem against the e.stimaled probabilities with a lowess smooth 
.su pc rim posed. What do the plots suggest iibout the adequacy ol' the lit of the binary logstic 
regression models? 

f. For each of the separate binary logi.stic regressions carried out in part (cl), obtain the delta 
chi-square .statistic (14,85), the delta deviance statistic (14.86), and Cook's distance (14 87) 
for each observation. Plot each of the.se in .separate index plots and identify any influential 
ob.servation.s. Summarize your findings. 

14.58. Refer to the GDI data .set in Appendix C.2, Region is the nominal level response variable 
coded I = NE, 2 = NC, 3 二 S, and 4 = W. The pool of potential predictor variables includes 
population density (tola! population/land area), percent of population aged 18-34, percent of 
population aged 65 or older, serious crimes per capita (toti\I serious crimes/total population), 
percent high .school graduates, percent bachelor's degrees, percent be low poverty level, percent 
unemployment, and per capita income. The even-numbeiied cases are to be used in developing 
the polytomous logistic regression model, 

a. Fit polytomous regression model (14.99) using response variable region with 1=NE 
as the referent category. Which predictors appear to be most importanl 1 ? Interpret the 
resuUs. 

b. Conduct a -series of likelihood ratio lests to determine which predictors, if any, can be 
dropped from the nominal logistic regi-ession model. Control » at .01 for each lest. State 
the alternatives, decision rules, and conclusions. 

c. For the full model in part (a), carry out separate binary logistic regressions for each oTthe 
three comparisons with the referent category, as described at the top of page 612. How do 
the slope coefficients compare to those obtained in part (a). 

d. For each of the separate binary logistic mgies.sions cairied out in part (c), obtain the 
deviance residuals and plot them against the estimated probabilities with a lowess smooth 
superimposed- What do the plots suggest about the adequacy of the fit of the binary logistic 
regression models? 

e. For each of the separate binary logisiic regi'essions carried out in part (d), obtain the delta 
chi-square statistic (14.85), the delta deviance statistic (14.86), and Cook's distance (14.87) 
Ibr each observaiion. Plot each of these in separate index plots micl identify any influential 
observations. Summarize your findings. 


14.59. Refer to the Prostate cancer data set in Appendix C.5. Gleason score (variable 9) is the 
ordinal level response variable, and ihe pool of potential predictor variables includes PSA 
level, cancer volume, weight, age, benign prostatic hyperplasia, .seminal vesicle invasion, and 
capsular penetration (variable's 2 through 8). 

a. Fit the proportional odds model (14.105). Which pi'eclictor.s appear tu be most important? 
lnteipi-et the results. 

b. Conduct a series of Wdd test.s io determine which predictors, if any, can be dropped from 
the nominal logistic regression model. Conti'ol a at .05 for each test. State the afternatives, 
decision rule, mid conclusion. What i.s the approximiUe P-value of llic test? 



Chapter 14 Logistic Regression, Poisson Regression, and Generalized Linear Models 639 


c. Starring with the full model of part (a), use backward elimination to decide which predictor 
vari^>les can be dropped from the ordinal regression model. Control the or risk at .05 at 
each stage. Which variables should be retained? 

d. For the model in part (c), carry out separate binary logistic regressions for each of the two 
binary variables and Yp\ as described at the top of page 617. How do the estimated 
coefficients compare to those obtained in part (c)? 

e. For each of the separate binary logistic regressions carried out in part (d), obtain the 
deviance residuals and plot them against the estimated probabilities with a lowess smooth 
superimposed. What do the plots suggest about the adequacy of the fit of the binary logistic 
regression models? 

f. For each of the separate binary logistic regressions carried out in part (d), obtain the delta 
chi-square statistic, (14.85), the delta deviance statistic (14.86), and Cook’s distance (14.87) 

i./*v 

for each observation. Plot each of these in separate index plots and identify any influential 
observations. Summarize your findings. 

14.60. Refer to the Real estate sales data set in Appendix C.7. Quality of construction (variable 10) 
is the ordinal level response variable, and the pool of potential predictor variables includes 
sales price, finished square feet, number of bedrooms, number of bathrooms, air conditioning, 
garage size, pool, year built, lot size, and adjacent to highway (variables 2 through 9 and 12 
through 13). 

a. Fit the proportional odds model (14.105). Which predictors appear to be most important? 
Interpret the results. 

b. Conduct a series of Wald tests to determine which predictors, if any, can be dropped from 
the nominal logistic regression model. Control a at .01 for each test. State the alternatives, 
decision rules, and conclusions. Which predictors should be retained? 

c. Starting with the full model of part (a), use backward elimination to decide which predictor 
variables can be dropped from the ■ordinal regression model. Control the a risk at .05 at 
each stage. Which variables should be retained? 

d. For the model obtained in part (c), carry out separate binary logistic regressions for each 
of the two binary variables and yP\ as described at the top of page 617. How do the 
estimated coefficients compare to those obtained in part (a)? 

e. For each of the separate binary logistic regressions carried out in part (d), obtain the 
deviance residuals and plot them against the estimated probabilities with a lowess smooth 
superimposed. What do the plots suggest about the adequacy of the fit of the binary logistic 
regression models? 

f. For each of the separate binary logistic regressions carried out in part (d), obtain the delta 
chi-square statistic (14.85), the delta deviance statistic (14.86), and Cook’s distance (14.87) 
for each observation. Plot each of these in separate index plots and identify any influential 
observations. Summarize your findings. 

14.61. Refer to the Ischemic heart disease data set in Appendix C.9. The response is the number 
of emergency room visits (variable 7) and the pool of potential predictor variables includes 
total cost, age, gender, number of interventions, number of drags, number of complications, 
number of comorbidities, and duration (variables 2 through 6 and 8 through 10). 

a. Obtain the fitted the Poisson regression model (14.113) with the response function 
^t(X, p) = ejq)(X , P). State the estimated regression coefficients, their estimated standard 
deviations, and the estimated response function. 

b. Obtain the deviance residuals (14.118) and plot them against the estimated model probabil¬ 
ities with a lowess smooth superimposed. What does the plot suggest about the adequacy 
of the fit of the Poisson regression model? 



640 Part Three Nonlinear Regression 


Case 

Studies 


c. Conduct a series of Wald tests to determine which predictors, if any, can be drootieH f 
the nominal logistic regression model. Conirol a at .01 for each tc.si. State the alternati ^ ■ 
decision rules, anti conclusions. 气 

d Assuming that llie fitted model in part (a) is appropriate, use the likelihood ratio r , 
determine whether duration, coomplications, and comorbiclities can be dropped fro 
model; conirol a at .05. State the full and reduced models, decision rule, and conciusi 
e. Use backward elimination to decide which predictor Viiriables can be dropped fromth" 
regression model. Control the a risk at. 10 at each stage. Which variables are retained? 


14.62. Refer to the IPO data set in Appendi x C. f 1. Carry out a complete analysis of this data set whe_ 

the iiesponse of interest is venture capital funding, and the pool of predictors inefudes fa 
value of the company, number of .shares offered, and whether or not the company underwen" 
a leveraged buyout. The analysis should consider transformations ol* predictors, inclusion o 
second-order predictors, analysis of residuals and influential observations, model selection? 
goodness of (it evaluation, and the development of an ROC curve. Model validation should 
also be employed. Document the steps taken in your analysis, and assess the strengths and 
weaknesses of your final model. e 

s 

14.63. Refer to the Real estate sales data set in Appendix C.7. Create a new binary response vari¬ 
able Y, called high quality construction, by letting Y — I if quality (variable 10) equals l,an(J 
y = () otherwise (i,e., if quality equals 2 or 3). Carry out a complete logistic regression analy¬ 
sis. where the response of interest is Iiigh quality construction (Y), and the pool of predictors 
includes sales price, finished square feet, number of bedrooms, number of bathrooms, _ 
conditioning, garage size, pool, year built, style, lot size, and adjacent to highway (variables2 
though 9 and 11 through 13). The analysis should consider transformations of predictors,^ 


inclusion of second-order predictors, analysis of residuals and influential observations, model 
selection, goodness of fit evaluation, and tile development of an ROC curve. Develop a predic¬ 
tion rule for determining whether the quality of construction is predicted to be of highqualily 
or not. Model validation should also be employed. Document the steps laken in your analysis, 


and assess the strengths and weaknesses of your final model. f 

14.64. Refer to the Prostate cancer data set in Appendix C.5. Create a new binary response vari J / 
able Y. called high-grade cancer, by letting Y = I if Gleason score (variable 9) equals 8, and^ 
K 二 （） otherwi.se (i,e., if Gleason score equals 6 or 7). Carry out a complete logistic i^ressiorii 
analysis, where the response of interest is high-grade cancer (K), and the pool of predic -： 
tors includes PSA level, cancer volume, weight, age, benign prostatic hyperplasia, seminal ； 
vesicle invasion, and capsular penetration (variables 2 ihrough 8). The analysis should con-’, 
sider transformations of predictors, inclusion of second-order predictors, analysis of residuals 
and inliuential observations, model selection, goodness of fit evaluation, and the development, 
of an ROC curve. Develop a prediction rule for determining whether the grade of disease is. 
predicted to be high grade or not. Model validation should also be employed Ctocumentthe: 
steps taken in your analysis, and assess the strengths and weaknes.se.s of your final model. 



